<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#(1)-Data-loading-and-preprocessing" data-toc-modified-id="(1)-Data-loading-and-preprocessing-1">(1) Data loading and preprocessing</a></span></li><li><span><a href="#(2)-Off-Policy-Learning" data-toc-modified-id="(2)-Off-Policy-Learning-2">(2) Off-Policy Learning</a></span></li><li><span><a href="#(3)-Off-Policy-Evaluation" data-toc-modified-id="(3)-Off-Policy-Evaluation-3">(3) Off-Policy Evaluation</a></span></li></ul></div>

- https://github.com/st-tech/zr-obp 

In [1]:
from obp.dataset import OpenBanditDataset
from obp.policy import BernoulliTS
from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW

from pathlib import Path

# (1) Data loading and preprocessing

In [3]:
base_path = Path('.')

dataset = OpenBanditDataset(behavior_policy='random', campaign='all', data_path=base_path)
bandit_feedback = dataset.obtain_batch_bandit_feedback()

In [11]:
bandit_feedback['context'].shape

(1374327, 26)

# (2) Off-Policy Learning

In [12]:
evaluation_policy = BernoulliTS(
    n_actions=dataset.n_actions,
    len_list=dataset.len_list,
    is_zozotown_prior=True, # replicate the policy in the ZOZOTOWN production
    campaign="all",
    random_state=12345
)
action_dist = evaluation_policy.compute_batch_action_dist(
    n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)

# (3) Off-Policy Evaluation

In [14]:
ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)

# estimated performance of BernoulliTS relative to the ground-truth performance of Random
relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
print(relative_policy_value_of_bernoulli_ts)

`estimated_rewards_by_reg_model` is not given; model dependent estimators such as DM or DR cannot be used.


1.35292399328859
