## Learning from logged bandit feedback

One of the most common ways that recommender systems are developed in practice involves building models trained on the historical behaviour of the running policy.  This is to be distinguished from bandit approaches such as upper confidence bound or Thompson sampling or full reinforcement learning as in these systems there is no clear separation between a learning stage and an acting stage.  In the approaches considered here, we first learn a model and then deploy a static model that does not change further.

Here we describe a simple supervised approach where we model the probability of a click conditional upon features that are created from a combination of the users' attributes and the recommendation.

Contextual bandits that use the inverse propensity score will be investigated in future versions of Reco Gym.

## The Data


In [63]:
from pylab import *
import gym, reco_gym
import pandas as pd

from reco_gym import env_1_args

env_1_args['random_seed'] = 42

env = gym.make('reco-gym-v1')
env.init_gym(env_1_args)

data = env.generate_data(100)
print(data[:25])

    v  u  r  c   ps
0   4  1 -1 -1  NaN
1  -1  1  1  0  0.1
2  -1  1  8  0  0.1
3  -1  1  8  0  0.1
4  -1  1  9  0  0.1
5  -1  1  1  0  0.1
6  -1  1  0  0  0.1
7  -1  1  7  0  0.1
8  -1  1  4  0  0.1
9  -1  1  8  0  0.1
10 -1  1  8  0  0.1
11  4  1 -1 -1  NaN
12  9  1 -1 -1  NaN
13  5  1 -1 -1  NaN
14 -1  1  3  0  0.1
15 -1  1  5  1  0.1
16  4  1 -1 -1  NaN
17  4  1 -1 -1  NaN
18  4  1 -1 -1  NaN
19  4  1 -1 -1  NaN
20  4  1 -1 -1  NaN
21  4  1 -1 -1  NaN
22 -1  1  1  0  0.1
23 -1  1  9  0  0.1
24 -1  1  5  0  0.1


## Turn the data into features

In [50]:
train = []
for ii in range(1, 1 + data.u.max()):
    counts_v, _ = histogram([v for v in array(data[data.u == ii].v) if v > -1], range = (0, env.num_products))
    sub = data[data.u == ii]
    sub = sub[sub.r != -1]

    for i, c in zip(range(env.num_products), counts_v):
        sub['c_v%d' % (i)] = c
    train.append(sub)
train = pd.concat(train)

In [51]:
print(train[:25])

    v  u  r  c   ps  c_v0  c_v1  c_v2  c_v3  c_v4  c_v5  c_v6  c_v7  c_v8  \
1  -1  1  4  0  0.1     0     0     0     0    11     1     0     0     0   
2  -1  1  1  0  0.1     0     0     0     0    11     1     0     0     0   
3  -1  1  1  0  0.1     0     0     0     0    11     1     0     0     0   
4  -1  1  7  0  0.1     0     0     0     0    11     1     0     0     0   
5  -1  1  8  0  0.1     0     0     0     0    11     1     0     0     0   
6  -1  1  0  0  0.1     0     0     0     0    11     1     0     0     0   
7  -1  1  6  0  0.1     0     0     0     0    11     1     0     0     0   
8  -1  1  2  0  0.1     0     0     0     0    11     1     0     0     0   
9  -1  1  0  0  0.1     0     0     0     0    11     1     0     0     0   
10 -1  1  4  0  0.1     0     0     0     0    11     1     0     0     0   
14 -1  1  8  0  0.1     0     0     0     0    11     1     0     0     0   
15 -1  1  4  1  0.1     0     0     0     0    11     1     0     0     0   

In [52]:
y = train.c
X = train.loc[:, ['r'] + list(train.keys()[5:])]
# 1. import
from sklearn.linear_model import LogisticRegression

# 2. instantiate model
logreg = LogisticRegression()

# 3. fit 
lr = logreg.fit(X, y)



Let's try to check how the Logistic Regression works.

In [53]:
# Check the probability of getting click for Product '3' with 7 observations for that Product.
test_X = np.zeros((1, env_1_args['num_products'] + 1))

test_X[:, 0] = 3
test_X[:, 3 + 1] = 7
test_Y = lr.predict_proba(test_X)
print("Product #3 was shown 7 times")
print("Test X: ", test_X)
print("Test Y: ", test_Y)

# Check the probability of getting click for Product '3' with 70 observations for that Product.
test_X[:, 3 + 1] = 70
test_Y = lr.predict_proba(test_X)
print("Product #3 was shown 70 times")
print("Test X: ", test_X)
print("Test Y: ", test_Y)

Product #3 was shown 7 times
Test X:  [[3. 0. 0. 0. 7. 0. 0. 0. 0. 0. 0.]]
Test Y:  [[0.9800594 0.0199406]]
Product #3 was shown 70 times
Test X:  [[ 3.  0.  0.  0. 70.  0.  0.  0.  0.  0.  0.]]
Test Y:  [[0.98669053 0.01330947]]


As you may see, the more frequently the product is shown, the higher the probability that it will be clicked.

Now, let's create a new Agent that incorporates that logic explicitly. I.e. the agent that calculates a _Probability Score_ of Click for a Product based on how many time the product was shown.

In [54]:
import numpy as np
from numpy.random import choice


# define agent class
class LoggedFeedBackLogistic:
    def __init__(self, env):
        # set environment as an attribute of agent
        self.env = env
        self.organic_views = np.zeros(self.env['num_products'])

    def train(self, observations, action, reward, done):
        for observation in observations:
            self.organic_views[observation[1]] += 1

    def offline_train(train):
        pass

    def act(self, _ = None):
        '''act method returns an action based on current observation and past
            history'''
        prob = self.organic_views / sum(self.organic_views)
        action = np.random.choice(self.env['num_products'], p = prob)
        return {
            'a': action,
            'ps': prob[action]
        }

    def reset(self):
        self.organic_views = np.zeros(self.env['num_products'])

In [55]:
ABTestNumberOfUsers = 2000

env.init_gym(env_1_args) # Reset the Environment
a_data = env.generate_data(ABTestNumberOfUsers, LoggedFeedBackLogistic(env_1_args))

In [56]:
print(a_data[:25])

    v  u  r  c   ps
0   4  1 -1 -1  NaN
1  -1  1  4  0  1.0
2  -1  1  4  0  1.0
3  -1  1  4  0  1.0
4  -1  1  4  0  1.0
5  -1  1  4  0  1.0
6  -1  1  4  0  1.0
7  -1  1  4  0  1.0
8  -1  1  4  0  1.0
9  -1  1  4  0  1.0
10 -1  1  4  0  1.0
11  4  1 -1 -1  NaN
12  9  1 -1 -1  NaN
13  5  1 -1 -1  NaN
14 -1  1  4  0  1.0
15 -1  1  4  1  0.5
16  4  1 -1 -1  NaN
17  4  1 -1 -1  NaN
18  4  1 -1 -1  NaN
19  4  1 -1 -1  NaN
20  4  1 -1 -1  NaN
21  4  1 -1 -1  NaN
22 -1  1  4  0  0.5
23 -1  1  4  0  0.8
24 -1  1  9  0  0.1


Let's create a new Agent that uses a Logistic Regression Model (the one we tested above) without explicit calculation of _Probability Scores_ and compare its behaviour with the Agent that calculates _Probability Scores_ explicitly.

In [57]:
def BuildLogisticRegressionData(
    data,
    env,
    only_with_clicks = False,
    sliding_window_depth = -1
):
    train_X = []
    train_Y = []
    
    number_of_users = data.u.max()
    number_of_products = env['num_products']

    for user_id in range(number_of_users):
        views = np.zeros((0, number_of_products))
        for _, user_datum in data[data['u'] == user_id].iterrows():
            if user_datum['c'] == -1:
                view = int(user_datum['v'])
                tmp_view = np.zeros(number_of_products)
            
                tmp_view[view] = 1
                views = np.append(tmp_view[np.newaxis, :], views, axis = 0)
            else:
                assert(user_datum['c'] != -1)
                assert(user_datum['r'] != -1)
                view = int(user_datum['r'])
                
                if views.shape[0] <= sliding_window_depth or sliding_window_depth == -1:
                    train_views = views
                else:
                    train_views = views[:sliding_window_depth, :]
                if only_with_clicks:
                    if user_datum['c'] != 0:
                        train_X.append(train_views.sum(axis = 0))
                        train_Y.append(view)
                else:
                    train_X.append(np.append(user_datum['r'], train_views.sum(axis = 0)))
                    train_Y.append(user_datum['c'])

    train_X = np.array(train_X).reshape(
        len(train_Y), 
        number_of_products if only_with_clicks else number_of_products + 1
    )
    train_Y = np.array(train_Y)

    return train_X, train_Y


# Define agent class
class ModelBasedBackLogistic:
    def __init__(self, env, model, log_based = False):
        # Set environment as an attribute of agent
        self.env = env
        self.model = model
        self.log_based = log_based
        self.number_of_products = self.env['num_products']
        self.organic_views = np.zeros((self.number_of_products, self.number_of_products + 1))
        self.organic_views[:, 0] = range(self.number_of_products)
        
    def train(self, observations, action, reward, done):
        for observation in observations:
            self.organic_views[:, observation[1] + 1] += 1

    def offline_train(train):
        pass
    
    def act(self, _ = None):
        '''Act method returns an action based on current observation and past history'''
        if self.log_based:
            prob = self.model.predict_log_proba(self.organic_views)[:, 1]
        else:
            prob = self.model.predict_proba(self.organic_views)[:, 1]
        action = np.argmax(prob)
        return {
            'a': action,
            'ps': prob[action]
        }
    
    def reset(self):
        self.organic_views = np.zeros((self.number_of_products, self.number_of_products + 1))
        self.organic_views[:, 0] = range(self.number_of_products)

At this time, we are ready to launch our Agents and compare their IPSs with the default Agent that provides a Product randomly (with _Probability Score_ ~ 1/10).

In [59]:
train_X, train_Y = BuildLogisticRegressionData(a_data, env_1_args)
print("Train X: ", train_X)
print("Train Y: ", train_Y)

model = LogisticRegression(
    solver = 'lbfgs',
    multi_class = 'multinomial',
    max_iter = 1000
).fit(train_X, train_Y)

env.init_gym(env_1_args) # Reset the Environment
b_data = env.generate_data(ABTestNumberOfUsers, ModelBasedBackLogistic(env_1_args, model))

Train X:  [[4. 0. 0. ... 0. 0. 0.]
 [4. 0. 0. ... 0. 0. 0.]
 [4. 0. 0. ... 0. 0. 0.]
 ...
 [4. 0. 0. ... 0. 0. 1.]
 [9. 0. 0. ... 0. 0. 1.]
 [4. 0. 0. ... 0. 0. 1.]]
Train Y:  [0. 0. 0. ... 0. 0. 0.]


In [60]:
print(b_data[:25])

    v  u  r  c        ps
0   4  1 -1 -1       NaN
1  -1  1  0  0  0.031206
2  -1  1  0  0  0.031206
3  -1  1  0  0  0.031206
4  -1  1  0  0  0.031206
5  -1  1  0  0  0.031206
6  -1  1  0  0  0.031206
7  -1  1  0  0  0.031206
8  -1  1  0  0  0.031206
9  -1  1  0  0  0.031206
10 -1  1  0  0  0.031206
11  4  1 -1 -1       NaN
12  9  1 -1 -1       NaN
13  5  1 -1 -1       NaN
14 -1  1  0  0  0.031206
15 -1  1  0  0  0.031136
16 -1  1  0  0  0.031136
17 -1  1  0  0  0.031136
18 -1  1  0  0  0.031136
19 -1  1  0  0  0.031136
20  4  1 -1 -1       NaN
21  4  1 -1 -1       NaN
22  4  1 -1 -1       NaN
23  4  1 -1 -1       NaN
24 -1  1  0  0  0.031136


**Note:** you shall see that _Organic_ events for cases *A* and *B* are similar. That is as it supposed to be because we are trying to test our new Agents within the same environments.

_Inversed Probability Score_ is calculated as follow:
$$ IPS = \sum_{i \in B} \frac{\pi_i^*(c|O)}{\pi_i^r(c|O)} 1_c, $$
where:
* $ B $: a set of _Bandit_ events
* $ \pi_i^*(c|O) $: a new _Policy_
* $ \pi_i^r(c|O) $: an old _Random Policy_
* $ 1_c $: one when by applying a new policy a click has been drawn
* $ O $: observetions of _Products_; now, it is a vector that contains counters how many time each _Product_ was shown

In [61]:
ABanditEvents = a_data[a_data['v'] == -1]
print("Amount of Bandit Events for Case A", len(ABanditEvents))
ABanditEventsRewards = ABanditEvents[ABanditEvents['r'] == 1]

BBanditEvents = b_data[b_data['v'] == -1]
print("Amount of Bandit Events for Case B", len(BBanditEvents))
BBanditEventsRewards = BBanditEvents[BBanditEvents['r'] == 1]

print("A Test Rewards: ", len(ABanditEventsRewards))
print("B Test Rewards: ", len(BBanditEventsRewards))

A_IPS = np.sum(ABanditEventsRewards['ps']) / float(env_1_args['num_products'])
print("A IPS vs. Random Choise: ", A_IPS)

B_IPS = np.sum(BBanditEventsRewards['ps']) / float(env_1_args['num_products'])
print("B IPS vs. Random Choise: ", B_IPS)

Amount of Bandit Events for Case A 145148
Amount of Bandit Events for Case B 150300
A Test Rewards:  1010
B Test Rewards:  0
A IPS vs. Random Choise:  43.52856073822884
B IPS vs. Random Choise:  0.0


As you might see, the _Agent_ that is based on a Logistic Regression Model is less sensitive to data.
Let's try to apply that agent but with _Log Probability_ and check the results.

In [62]:
env.init_gym(env_1_args) # Reset the Environment
c_data = env.generate_data(ABTestNumberOfUsers, ModelBasedBackLogistic(env_1_args, model, True))


CBanditEvents = c_data[c_data['v'] == -1]
print("Amount of Bandit Events for Case C", len(CBanditEvents))
CBanditEventsRewards = CBanditEvents[CBanditEvents['r'] == 1]

C_IPS = np.sum(CBanditEventsRewards['ps']) / float(env_1_args['num_products'])
print("C IPS vs. Random Choise: ", C_IPS)

Amount of Bandit Events for Case C 150300
C IPS vs. Random Choise:  0.0
