In [1]:
import pandas as pd
from ope.methods import inverse_propensity_scoring

### Example using IPS to offline evaluate a new fraud policy

#### 1 - Assume we have a fraud model in production that blocks transactions if the P(fraud) > 0.05

Let's build some sample logs from that policy running in production. One thing to note, we need some basic exploration in the production logs (e.g. epsilon-greedy w/ε = 0.1). That is, 10% of the time we take a random action. Rewards represent revenue gained from allowing the transaction. A negative reward indicates the transaction was fraud and resulted in a chargeback.

In [2]:
logs_df = pd.DataFrame([
    {"context": {"p_fraud": 0.08}, "action": "blocked", "action_prob": 0.90, "reward": 0},
    {"context": {"p_fraud": 0.03}, "action": "allowed", "action_prob": 0.90, "reward": 20},
    {"context": {"p_fraud": 0.01}, "action": "allowed", "action_prob": 0.90, "reward": 10},    
    {"context": {"p_fraud": 0.09}, "action": "allowed", "action_prob": 0.10, "reward": -20}, # only allowed due to exploration 
])

logs_df

Unnamed: 0,context,action,action_prob,reward
0,{'p_fraud': 0.08},blocked,0.9,0
1,{'p_fraud': 0.03},allowed,0.9,20
2,{'p_fraud': 0.01},allowed,0.9,10
3,{'p_fraud': 0.09},allowed,0.1,-20


#### 2 - Now let's use IPS to score a more lenient fraud model that blocks transactions only if the P(fraud) > 0.10

IPS requires that we know `P(action | context)` for the new policy. We can easily describe our new policy:

In [3]:
def action_probabilities(context):
    epsilon = 0.10
    if context["p_fraud"] > 0.10:
        return {"allowed": epsilon, "blocked": 1 - epsilon}    
    
    return {"allowed": 1 - epsilon, "blocked": epsilon}

We can now get the probability that the new policy takes the same action that was taken in the production logs above.

In [4]:
logs_df["new_action_prob"] = logs_df.apply(
    lambda row: action_probabilities(row["context"])[row["action"]],
    axis=1
)
logs_df

Unnamed: 0,context,action,action_prob,reward,new_action_prob
0,{'p_fraud': 0.08},blocked,0.9,0,0.1
1,{'p_fraud': 0.03},allowed,0.9,20,0.9
2,{'p_fraud': 0.01},allowed,0.9,10,0.9
3,{'p_fraud': 0.09},allowed,0.1,-20,0.9


We see that the new policy lets through a fraud example (`row: 3`) at a much higher probability. This should make the new model get penalized in offline evaluation. We also see that for `row: 0`, the new model has a 90% chance of allowing the transaction, but we don't have the counterfactual knowledge of whether or not this would have been a non-fraud transaction since in production this transaction was blocked. This demonstrates ones of the drawbacks of offline policy evaluation, but with more data we'd ideally see a different action taken in the same situation (due to exploration).

#### 3 - Now we will score the new model using IPS

In [5]:
inverse_propensity_scoring.evaluate(logs_df, action_probabilities, num_bootstrap_samples=100)

{'expected_reward_logging_policy': {'mean': 2.98,
  'ci_low': -11.92,
  'ci_high': 17.87},
 'expected_reward_new_policy': {'mean': -37.42,
  'ci_low': -122.51,
  'ci_high': 47.66}}

The expected reward per observation for the new policy is much worse than the logging policy (due to the observation that allowed fraud to go through (`row: 3`)) so we wouldn't roll out this new policy into an A/B test or production and instead should test some different policies offline.

However, the confidence intervals around the expected rewards for our old and new policies overlap. If we want to be really certain, it's might be best to gather some more data to ensure the difference is signal and not noise. In this case, fortunately, we have strong reason to suspect the new policy is worse, but these confidence intervals can be important in cases where we have less prior certainty.