Skip to content
Estimators to perform off-policy evaluation
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.


Type Name Latest commit message Commit time
Failed to load latest commit information. Create Jan 16, 2019 Add confidence interval for ips (#3) Feb 21, 2020 Confidenceintervals (#2) Feb 13, 2020 with IPS and SNIPS from dsjson data file Jan 17, 2019

Estimators Library

In contextual bandits, a learning algorithm repeatedly observes a context, takes an action, and observes a reward for the chosen action. An example is content personalization: the context describes a user, actions are candidate stories, and the reward measures how much the user liked the recommended story. In essence, the algorithm is a policy that picks the best action given a context.

Given different policies, the metric of interest is their reward. One way to measure the reward is to deploy such policy online and let it choose actions (for example, recommend stories to users). However, such online evaluation can be costly for two reasons: It exposes users to an untested, experimental policy; and it doesn't scale to evaluating multiple target policies.

The alternative is off-policy evaluation: Given data logs collected by using a logging policy, off-policy evaluation can estimate the expected rewards for different target policies and provide confidence intervals around such estimates.

This repo collects estimators to perform such off-policy evaluation.

You can’t perform that action at this time.