# Demo of Open Bandit Pipeline

![](https://github.com/st-tech/zr-obp/blob/master/images/overview.png?raw=true)

**[Official documentation for OBP](https://zr-obp.readthedocs.io/en/latest/)**

## Quick Reference:

**OBP: Open Bandit Pipeline** - this software library

**OBD: Open Bandit Dataset** - the dataset supplied with it

**OPE: off-policy evaluation** - the process of determining how a policy _other than the one that was really run_ woudl have performed

## Dataset Loader

The first part of the Open Bandit Pipeline (OBP) is the dataset loader. For the Open Bandit Dataset (OBD), the loader is `opb.dataset.OpenBanditDataset` ([docs](https://zr-obp.readthedocs.io/en/latest/_autosummary/obp.dataset.real.html#obp.dataset.real.OpenBanditDataset)). 

As with many classes in the OBP, the dataset modules are implemented with [dataclasses](https://docs.python.org/3.7/library/dataclasses.html).

The dataset module inherits from `obp.dataset.base.BaseRealBanditDatset` ([docs](https://zr-obp.readthedocs.io/en/latest/_autosummary/obp.dataset.base.html#module-obp.dataset.base)) and should implement three methods:
- `load_raw_data()`: Load an on-disk representation of the dataset into the module. Used during initialization.
- `pre_process()`: Perform any preprocessing needed to transform the raw data representation into a final representation.
- `obtain_batch_bandit_feedback()`: Return a dictionary containing (at least) keys: `["action","position","reward","pscore","context","n_rounds"]`

It is also helpful if the dataset module exposes a property `len_list`, which is how many items the bandit shows the user at a time. Often the answer is 1, though in the case of OBD it's 3.

In [62]:
from pathlib import Path
from obp.dataset import OpenBanditDataset

DATASET = "./data/obd_full"
# DATASET = "./data/obd"

dataset = OpenBanditDataset(
    data_path=Path(DATASET),
    campaign="all",
    behavior_policy="random"
)

  mask |= (ar1 == a)


In [63]:
# the OBD load_raw_data doesn't need to be manually called but it's used internally

dataset.load_raw_data

<bound method OpenBanditDataset.load_raw_data of OpenBanditDataset(behavior_policy='random', campaign='all', data_path=PosixPath('data/obd_full/random/all'), dataset_name='obd')>

In [64]:
# to see the minimal preprocessing done by OBD, see source:
# https://zr-obp.readthedocs.io/en/latest/_modules/obp/dataset/real.html#OpenBanditDataset.pre_process

dataset.pre_process

<bound method OpenBanditDataset.pre_process of OpenBanditDataset(behavior_policy='random', campaign='all', data_path=PosixPath('data/obd_full/random/all'), dataset_name='obd')>

In [65]:
feedback = dataset.obtain_batch_bandit_feedback()
print("feedback dict:")
for key, value in feedback.items():
    print(f"  {key}: {type(value)}")

feedback dict:
  n_rounds: <class 'int'>
  n_actions: <class 'int'>
  action: <class 'numpy.ndarray'>
  position: <class 'numpy.ndarray'>
  reward: <class 'numpy.ndarray'>
  reward_test: <class 'numpy.ndarray'>
  pscore: <class 'numpy.ndarray'>
  context: <class 'numpy.ndarray'>
  action_context: <class 'numpy.ndarray'>


In [66]:
dataset.len_list

3

## Policy and Simulator

The policy object defines a bandit policy counterfactual—i.e. a bandit strategy that is not the same as what you ran in actuality. 

Policies are typicially initialized with `n_actions` (how many arms), `len_list` (how many arms can be shown at a time), `batch_size` (how often do we update parameters?)

The task of a policy object can be seen through what methods it implements:
- `initialize()`: set policy parameter starting values
- `select_action()`: decide which action to take (i.e. which arm to pull) (can be multiple actions if len_list > 1)
- `update_params(action, reward)`: update the policy parameters based on the action chosen and the reward received

Policies are initialized with a few parameters, depending on whether they are contextual or context-free. Information on those parameters can be found [here](https://zr-obp.readthedocs.io/en/latest/_autosummary/obp.policy.base.html).

**Note:** While many of Zozo's tutorials show the `BernoulliTS` policy being used and calling its `compute_batch_action_dist` function, this is actually not how most policies are run and it only works because `BernoulliTS` (and `Random`) is a fully randomized policy. 

Most policies are instead run using the `obp.sumulator.simulator.run_bandit_simulation(bandit_feedback, policy)` method ([docs](https://zr-obp.readthedocs.io/en/latest/_autosummary/obp.simulator.simulator.html#module-obp.simulator.simulator)).

### Opportunity: Better Batching

The policies can update their parameters every `batch_size` rounds, but if we have access to timestamps, it might be more valuable to update parameters every _day_, to better simulate how actual training would likely look.

In [67]:
from obp.policy import EpsilonGreedy

eps_greedy = EpsilonGreedy(
    n_actions=dataset.n_actions,
    len_list = dataset.len_list,
    batch_size=1, # update parameters after every round
    random_state=0,
    epsilon=0.2
)

In [68]:
from obp.simulator.simulator import run_bandit_simulation

actions = run_bandit_simulation(bandit_feedback=feedback, policy=eps_greedy)

print(f"\nShape of returned actions array: {actions.shape}\n")
print(f"# data points: {dataset.n_rounds}")
print(f"# actions ('arms'): {dataset.n_actions}")
print(f"# positions: {dataset.len_list}")

100%|██████████| 1374327/1374327 [00:34<00:00, 39334.03it/s]



Shape of returned actions array: (1374327, 80, 3)

# data points: 1374327
# actions ('arms'): 80
# positions: 3


## Off-Policy Evaluation

The off-policy evaluator (or OPE) is responsible for determining how our chosen policy would have performed. Running the simulator on the policy gives us the actions it would have taken (or some probabilities), but now we need to know what kind of rewards those actions would lead to.

This is easier said than done. One way to do it is to go through the log data and only use the rewards of cases when the logged action matches the action your new policy would've taken. This could easily be a bad estimate if the choices of the original policy look nothing like the choices the new policy would make (though if your original policy was uniform random, that's helpful). 

Methods like inverse propensity weighting try to compensate for the likelihood of an action being in the logged data, though it does nothing in the event of uniform random data since all actions were uniformly random.

Other methods try to learn a regression model so that they can predict rewards for any user-action pair, though these methods are heavily reliant on that model being accurate.

An off-policy estimator implements two methods:
- `estimate_policy_value()` which gets an average reward per round
- `estimate_interval()` which gets a confidence interval of rewards via bootstrap

OPE methods which rely on regression to model rewards need to be given those rewards at initialization time. OBP provides a wrapper `obp.ope.regression_model` ([docs](https://zr-obp.readthedocs.io/en/latest/_autosummary/obp.ope.regression_model.html)) for using an off-the-shelf regression model for that piece of the task.

### Opportunity: Better evaluation than mean reward

Maybe we don't think mean reward is a particularly interesting metric. A method which does heavy exploration at the beginning will likely get low rewards at the beginning, but might end up getting much better. It could be worth comparing graphs of rolling average rewards to see how well and how fast different bandit policies learn over the course of the logged data.

In [69]:
feedback.keys()

dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'reward_test', 'pscore', 'context', 'action_context'])

In [72]:
from obp.ope.estimators import ReplayMethod

replay_evaluator = ReplayMethod()

results = replay_evaluator.estimate_interval(
    reward=feedback["reward"],
    action=feedback["action"],
    position=feedback["position"],
    action_dist=actions,
)

In [78]:
print(f"mean={np.round(results['mean'],4)}")
print(f"lower={np.round(results['95.0% CI (lower)'],4)}")
print(f"upper={np.round(results['95.0% CI (upper)'],4)}")

mean=0.0047
lower=0.0037
upper=0.0057


In [82]:
ground_truth_mean = feedback["reward"].mean()
print(f"ground truth value = {np.round(ground_truth_mean,4)}")

ground truth value = 0.0035


In [85]:
print(f"relative improvement: {np.round(results['mean'] / ground_truth_mean,4)}")

relative improvement: 1.3422
