In [1]:
import numpy as np
import pandas as pd

from obp.ope import ReplayMethod
from obp.policy import EpsilonGreedy
from obp.simulator import run_bandit_simulation
from obp.utils import convert_to_action_dist

from sd_bandits.obp_extensions.dataset import DeezerDataset

# Working with Deezer Data

Deezer's data is unlike ZOZO's because it's not logged feedback—instead it's a pretrained logistic regression model we can use to calculate user-item click probabilities. 

The `DeezerDataset` object lets us load the user and playlist features so that we can use it to get the data we want.

In [2]:
user_features_path = "../data/deezer_carousel_bandits/user_features.csv"
playlist_features_path = "../data/deezer_carousel_bandits/playlist_features.csv"

deezer_data = DeezerDataset(
    user_features_path,
    playlist_features_path,
    len_list=12,
    len_init=3,
)

From here we have three options on what we can do next:

1. Simulate logs from a uniformly random policy, then do off-policy training and evaluation of a new policy using `ReplayMethod`.
2. Simulate logs from any policy we want, then just read out the rewards (using `ReplayMethod` or equivalently `rewards.mean()`).
3. Simulate logs from a uniformly random policy, then do off-policy training and evaluation using regression-based estimators, where the regression model is simply Deezer's logistic regression.

I would argue that #3 is an unnecessarily complicated version of #2. I think #3 makes sense in ZOZO's case, where you have logged data and you're training a regression, but in our case where we have a regression and we're generating "logged data," it feels a bit bizarre and I think we should skip it. 

In this notebook, I'm only going to demonstrate #1 and #2.

## 1. Make random logs and perform off-policy learning & estimation

### Make random logs 

The first method is to use the data Deezer gives us to create data that looks like ZOZO's logs. We do this following a procedure similar to Deezer's simulation procedure:
1. Select `n_batches` of `users_per_batch` random users (with replacement).
2. For each users, select items uniformly at random.
3. Observe the rewards by calculating user-item click probabilities with Deezer's logistic features.
4. Optionally simulate Deezer's cascade observation model.

In [3]:
random_deezer_feedback = deezer_data.obtain_batch_bandit_feedback(
    n_batches=100,
    users_per_batch=1000,
    cascade=True,
    seed=1,
)

print("\ncascade is enabled, so we observe at least 3 items per user per user session")
print("min number of actions is thus 100 batches * 1000 users * 3 items = 300,000")
print("feedback dict:")
for key, value in random_deezer_feedback.items():
    if key[0:2] != "n_":
        print(f"  {key}: {type(value)}, {value.shape}")
    else:
        print(f"  {key}: {value}")

Calculating click probabilities: 100%|██████████| 100000/100000 [00:16<00:00, 6048.32it/s]
Generating feedback: 100%|██████████| 100000/100000 [00:00<00:00, 129875.72it/s]



cascade is enabled, so we observe at least 3 items per user per user session
min number of actions is thus 100 batches * 1000 users * 3 items = 300,000
feedback dict:
  action: <class 'numpy.ndarray'>, (333027,)
  reward: <class 'numpy.ndarray'>, (333027,)
  position: <class 'numpy.ndarray'>, (333027,)
  context: <class 'numpy.ndarray'>, (333027, 97)
  action_context: <class 'numpy.ndarray'>, (333027, 97)
  pscore: <class 'numpy.ndarray'>, (333027,)
  n_rounds: 333027
  n_actions: 862
  users: <class 'numpy.ndarray'>, (100000,)


What's the expected reward of the uniformly random policy?

In [4]:
exp_rand_reward = round(random_deezer_feedback["reward"].mean(),4)
print(f"Expected reward for uniform random actions: {exp_rand_reward}")

Expected reward for uniform random actions: 0.027


### Off policy-learning

Now if we want to know how epsilon-greedy bandits perform, we have to use the simulator to do off-policy learning. 

This means we run through our new simulated uniformly-random item logs and we can only update our epsilon-greedy bandit and observe a reward if the action it presents matches what was presented in our random dataset.

Unfortunately, matches are very rare! Especially when there are 12 positions and 800+ playlists. We're going to set our bandit's `batch_size=1` just to guarantee we update our bandit params as many times as we can.

In [5]:
e_greedy_simulated_log = EpsilonGreedy(
    n_actions=deezer_data.n_actions,
    len_list=12,
    batch_size=1,
    random_state=1,
    epsilon=0.2,
)

action_dist_from_simulated_log = run_bandit_simulation(random_deezer_feedback, e_greedy_simulated_log)

number_of_matches = e_greedy_simulated_log.n_trial
number_of_rounds = random_deezer_feedback["n_rounds"]
print(f"\nThe epilon-greedy bandit's actions matched on only {number_of_matches} rounds, out of a possible {number_of_rounds} :(")

100%|██████████| 333027/333027 [00:14<00:00, 22511.74it/s]



The epilon-greedy bandit's actions matched on only 350 rounds, out of a possible 333027 :(


### Off-policy evaluation

So how well did our off-policy-trained epsilon-greedy bandit do? We don't expect it to do particularly well since it could only update on a handful of matched examples. Furthermore, we expect high variance estimates considering that there are so few data points to work with.

In [6]:
replay_estimator = ReplayMethod()

off_policy_eval = replay_estimator.estimate_interval(
    reward=random_deezer_feedback["reward"],
    action=random_deezer_feedback["action"],
    position=random_deezer_feedback["position"],
    action_dist=action_dist_from_simulated_log,
    random_state=1
)

mean_eps_greedy_log_reward = np.round(off_policy_eval["mean"], 4)
eps_greedy_log_relative = np.round(off_policy_eval["mean"] / random_deezer_feedback["reward"].mean(), 2)

print(f"Expected reward for epsilon-greedy bandit trained on random logs: {mean_eps_greedy_log_reward}",
      f"(or {eps_greedy_log_relative}x random baseline)")

lo_eps_greedy_log_reward = np.round(off_policy_eval["95.0% CI (lower)"], 4)
hi_eps_greedy_log_reward = np.round(off_policy_eval["95.0% CI (upper)"], 4)
print(f"95% confidence interval is {lo_eps_greedy_log_reward}-{hi_eps_greedy_log_reward}, a super big range!")

Expected reward for epsilon-greedy bandit trained on random logs: 0.0148 (or 0.55x random baseline)
95% confidence interval is 0.0029-0.0286, a super big range!


As we can see, the model trained this way probably did _worse_ than random, and had a huge confidence interval.

Surely there's a better way!

## 2. Do _online_ bandit learning

Since we can calculate the probability for every possible user-item combo, there's no need to rely on a fake log, we can just use the model. 

This time, we'll supply the `policy` argument to `obtain_batch_bandit_feedback`: now instead of uniformly random actions chosen for each user, we'll get actions chosen by the supplied bandit policy. 

Furthermore, the policy will update once per batch of randomly selected _users_ to better simulate Deezer's experiment.

In [7]:
e_greedy = EpsilonGreedy(
    n_actions=deezer_data.n_actions,
    len_list=12,
    # this batch_size setting will be ignored because supplying the policy
    # to `deezer_data.obtain_batch_bandit_feedback` will manually update
    # once per batch of *users*
    batch_size=1, 
    random_state=1,
    epsilon=0.2,
)

eg_deezer_feedback = deezer_data.obtain_batch_bandit_feedback(
    policy=e_greedy,
    n_batches=100, # this is how many times our bandit will have its params updated
    users_per_batch=1000,
    cascade=True,
    seed=1
)

print("\ncascade is enabled, so we observe at least 3 items per user per user session")
print("min number of actions is thus 10 batches * 100000 users * 3 items = 300,000")
print("feedback dict:")
for key, value in eg_deezer_feedback.items():
    if key[0:2] != "n_" and key != "policy" and value is not None:
        print(f"  {key}: {type(value)}, {value.shape}")
    else:
        print(f"  {key}: {value}")

Simulating online learning: 100%|██████████| 100000/100000 [00:14<00:00, 6689.06it/s]



cascade is enabled, so we observe at least 3 items per user per user session
min number of actions is thus 10 batches * 100000 users * 3 items = 300,000
feedback dict:
  action: <class 'numpy.ndarray'>, (403200,)
  reward: <class 'numpy.ndarray'>, (403200,)
  position: <class 'numpy.ndarray'>, (403200,)
  context: <class 'numpy.ndarray'>, (403200, 97)
  action_context: <class 'numpy.ndarray'>, (403200, 97)
  pscore: None
  n_rounds: 403200
  n_actions: 862
  policy: EpsilonGreedy(n_actions=862, len_list=12, batch_size=403200, random_state=1, epsilon=0.2, policy_name='egreedy_1.0')
  selected_actions: <class 'numpy.ndarray'>, (403200, 12)
  users: <class 'numpy.ndarray'>, (100000,)


Now we've generated a dataset that contains the actions and rewards generated by an online experiment with our epsilon-greedy bandit.

Using the `ReplayMethod` here isn't strictly necessary: since we did online learning, our logs always match our actions and so we could just as easily get `mean_eps_greedy_online_reward = eg_deezer_feedback["reward"].mean()`.

In [8]:
replay_estimator = ReplayMethod()

eps_greedy_estimates = replay_estimator.estimate_interval(
    reward=eg_deezer_feedback["reward"],
    action=eg_deezer_feedback["action"],
    position=eg_deezer_feedback["position"],
    action_dist=convert_to_action_dist(deezer_data.n_actions, eg_deezer_feedback["selected_actions"]))

mean_eps_greedy_online_reward = np.round(eps_greedy_estimates["mean"], 4)
eps_greedy_online_relative = np.round(eps_greedy_estimates["mean"] / random_deezer_feedback["reward"].mean(), 2)

print(f"Expected reward for epsilon-greedy bandit trained online: {mean_eps_greedy_online_reward}",
      f"({eps_greedy_online_relative}x random baseline)!")

lo_eps_greedy_online_reward = np.round(eps_greedy_estimates["95.0% CI (lower)"], 4)
hi_eps_greedy_online_reward = np.round(eps_greedy_estimates["95.0% CI (upper)"], 4)
print(f"95% confidence interval is {lo_eps_greedy_online_reward}-{hi_eps_greedy_online_reward}, a much smaller range!")

Expected reward for epsilon-greedy bandit trained online: 0.0602 (2.22x random baseline)!
95% confidence interval is 0.0594-0.0609, a much smaller range!


Wow, much much better!

Note that we can get _even better_ performance by updating our parameters more often. 

We'll switch from `n_batches=100`, `users_per_batch=1000` to `n_batches=100000`, `user_per_batch=1` so we'll update 100000 times instead of 100 times.

In [9]:
e_greedy = EpsilonGreedy(
    n_actions=deezer_data.n_actions,
    len_list=12,
    batch_size=1, # this will be ignored
    random_state=1,
    epsilon=0.2,
)

eg_deezer_feedback_2 = deezer_data.obtain_batch_bandit_feedback(
    policy=e_greedy,
    n_batches=100000,
    users_per_batch=1,
    cascade=True,
    seed=1
)

mean_eps_greedy_online_reward_2 = np.round(eg_deezer_feedback_2['reward'].mean(),4)
print(f"Expected reward for epsilon-greedy bandit trained online and updated every round: {mean_eps_greedy_online_reward_2}")

Simulating online learning: 100%|██████████| 100000/100000 [00:16<00:00, 6230.29it/s]


Expected reward for epsilon-greedy bandit trained online and updated every round: 0.073


An improvement!

And by the way, if you're worried that we sampled different users, don't be: The same list of users will be generated because a) we used he same seed and b) the call to generate users is 
```python
        user_indices = rng.choice(
            range(len(self.user_features)),
            size=users_per_batch * n_batches,
            replace=True,
        )
```
and `1 * 100000` is equal to `1000 * 100`.

Don't believe me?

In [10]:
all(eg_deezer_feedback_2["users"] == eg_deezer_feedback["users"])

True

## Final proposal

Since the Deezer dataset has so many positions and playlists, the likelihood of "matches" on randomly-generated logs is extremely low, and all bandits will perform terribly. I propose we instead do these "online bandit learning" experiments (the #2 method) and just use the `ReplayMethod` to calculate our confidence interval of expected reward.