## Delta Method Analysis Tutorial
This notebook shows how the DeltaMethodAnalysis class is performing the Delta Method for a simple ratio metric. In this case the cluster column is going to be at user level for the current example.

We also compare the usage of CUPED in tandem with the Delta Method as a way to reduce variance.


In [1]:
import numpy as np
import pandas as pd


from cluster_experiments.experiment_analysis import DeltaMethodAnalysis
from scipy.stats import norm

In [2]:
def generate_ratio_metric_data(
    N,
    num_users=2000,
    user_sample_mean=0.3,
    user_standard_error=0.15,
    treatment_effect=0.1,
) -> pd.DataFrame:

    user_sessions = np.random.choice(num_users, N)
    user_target_means = np.random.normal(
        user_sample_mean, user_standard_error, num_users
    )
    # assign treatment groups
    treatment = np.random.choice([0, 1], num_users)

    # create target rate per session level
    target_percent_per_session = (
        treatment_effect * treatment[user_sessions]
        + user_target_means[user_sessions]
        + np.random.normal(0, 0.01, N)
    )

    # Remove <0 or >1
    target_percent_per_session[target_percent_per_session > 1] = 1
    target_percent_per_session[target_percent_per_session < 0] = 0

    targets_observed = np.random.binomial(1, target_percent_per_session)

    # rename treatment array 0-->A, 1-->B
    mapped_treatment = np.where(treatment == 0, "A", "B")

    return pd.DataFrame(
        {
            "user": user_sessions,
            "treatment": mapped_treatment[user_sessions],
            "target": targets_observed,
            "scale": np.ones_like(user_sessions),
        }
    )

In [3]:
def compare_vanilla_cuped(N, num_users=2000, user_sample_mean=0.3, user_standard_error=0.5, treatment_effect=0.1):
    #data generation
    pre_data = generate_ratio_metric_data(N, num_users, user_sample_mean, user_standard_error, treatment_effect=0)
    post_data = generate_ratio_metric_data(N, num_users, user_sample_mean, user_standard_error, treatment_effect)
    

    pre_data['date']= pd.to_datetime('2022-01-01')
    post_data['date']= pd.to_datetime('2022-01-02')

    data = pd.concat([pre_data, post_data])
    analysis_vanilla = DeltaMethodAnalysis(
        cluster_cols=["user"]
    )
    vanilla_pval = analysis_vanilla.get_pvalue(post_data)

    analysis_cuped = DeltaMethodAnalysis(
        cluster_cols=["user"], cuped_time_split=("date", post_data['date'].min())
    )
    cuped_pval = analysis_cuped.get_pvalue(data)

    #naive t-test
    post_data= post_data.groupby(['user', 'treatment']).agg({'target':'sum', 'scale':'sum'}).reset_index()
    post_data['metric']  = post_data['target']/post_data['scale']
    stats = post_data.groupby('treatment').agg(
        mean = ('metric', 'mean'),
        var = ('metric', 'var'),
        num_samples = ('scale', 'sum')
        )

    mean_dif = stats.loc['A', 'mean'] - stats.loc['B', 'mean']
    var_dif = stats.loc['A', 'var']/stats.loc['A', 'num_samples'] + stats.loc['B', 'var']/stats.loc['B', 'num_samples']
    z_score = mean_dif/np.sqrt(var_dif)
    naive_pval = 2 * (1 - norm.cdf(abs(z_score)))

    return vanilla_pval, cuped_pval, naive_pval

### Comparison

Here we compare the naive t-test estimation vs the Delta Method with and without CUPED

In [4]:
# Let's generate some fake switchback data (the clusters here would be city and date
N = 200_000

vanilla_p_values = []
cuped_p_values = []
naive_p_values = []
for _ in range(100):
    vanilla, cuped, naive = compare_vanilla_cuped(N, num_users = 5_000, treatment_effect= 0.05)
    vanilla_p_values.append(vanilla)
    cuped_p_values.append(cuped)
    naive_p_values.append(naive)

print(f"Naive p-value: {np.mean(naive_p_values)}")
print(f"Vanilla p-value: {np.mean(vanilla_p_values)}")
print(f"CUPED p-value: {np.mean(cuped_p_values)}")

Naive p-value: 3.992187913581802e-11
Vanilla p-value: 0.020746084288781594
CUPED p-value: 0.019287884190794198


It is known that Naive t-test will have a smaller estimation of the variance, increasing the false positive rate

### A-A test for validation

In [5]:
# Let's generate some fake switchback data (the clusters here would be city and date
N = 200_000

vanilla_p_values = []
cuped_p_values = []
naive_p_values = []
for _ in range(1000):
    vanilla, cuped, naive = compare_vanilla_cuped(N, num_users = 5_000, treatment_effect= 0)
    vanilla_p_values.append(vanilla)
    cuped_p_values.append(cuped)
    naive_p_values.append(naive)

print(f"Naive p-value: {np.mean(naive_p_values)}")
print(f"Vanilla p-value: {np.mean(vanilla_p_values)}")
print(f"CUPED p-value: {np.mean(cuped_p_values)}")

Naive p-value: 0.10592464218673368
Vanilla p-value: 0.500648753224461
CUPED p-value: 0.5036899545199222


In [6]:
positives_vanilla = sum([pval< 0.05 for pval in vanilla_p_values])
positives_cuped = sum([pval< 0.05 for pval in cuped_p_values])
positives_ttest = sum([pval< 0.05 for pval in naive_p_values])

print(f"Naive false positives rate: {positives_ttest/len(naive_p_values)}")
print(f"Vanilla false positives rate: {positives_vanilla/len(vanilla_p_values)}")
print(f"CUPED false positives rate: {positives_cuped/len(cuped_p_values)}")

Naive false positives rate: 0.748
Vanilla false positives rate: 0.061
CUPED false positives rate: 0.054


We can see that indeed if we are not using the Delta Method the False Positive rate might increase significantly due to the underestimation on the metric's variance