This notebook shows that, when using a clustered splitter, if the clusters explain a part of the variance, using a non-clustered analysis will lead to higher false positive rate than expected.

In particular, we use a clustered splitter and:
* An OLS-clustered robust estimator, we see that it passes the AA test
* A simple t-test, it fails the AA test as it returns a super high false positive rate

In [1]:
from datetime import date

import numpy as np
import pandas as pd
from cluster_experiments import PowerAnalysis, UniformPerturbator, BalancedClusteredSplitter, ExperimentAnalysis
import pandas as pd
from scipy.stats import ttest_ind


# Create fake data
N = 10_000
clusters = [f"Cluster {i}" for i in range(10)]
dates = [f"{date(2022, 1, i):%Y-%m-%d}" for i in range(1, 15)]
df = pd.DataFrame(
    {
        "cluster": np.random.choice(clusters, size=N),
        "date": np.random.choice(dates, size=N),
    }
).assign(
    # Target is a linear combination of cluster and day of week, plus some noise
    cluster_id=lambda df: df["cluster"].astype("category").cat.codes,
    day_of_week=lambda df: pd.to_datetime(df["date"]).dt.dayofweek,
    target=lambda df: df["cluster_id"] + df["day_of_week"] + np.random.normal(size=N),
)


In [2]:
df.head()

Unnamed: 0,cluster,date,cluster_id,day_of_week,target
0,Cluster 6,2022-01-10,6,0,6.771333
1,Cluster 1,2022-01-13,1,3,4.265339
2,Cluster 6,2022-01-12,6,2,7.634346
3,Cluster 0,2022-01-11,0,1,0.390531
4,Cluster 4,2022-01-08,4,5,9.441262


Some clusters have a higher average outcome than others

In [3]:
df.groupby("cluster").agg({"target": ["mean", "std"]})

Unnamed: 0_level_0,target,target
Unnamed: 0_level_1,mean,std
cluster,Unnamed: 1_level_2,Unnamed: 2_level_2
Cluster 0,3.027824,2.190903
Cluster 1,4.10805,2.237088
Cluster 2,5.052269,2.236289
Cluster 3,5.950557,2.219626
Cluster 4,7.041812,2.259418
Cluster 5,7.969491,2.223511
Cluster 6,9.094057,2.257794
Cluster 7,9.983032,2.240358
Cluster 8,10.960362,2.224778
Cluster 9,11.982331,2.254029


In [4]:
# Simple t-test to run the analysis
class TTestAnalysis(ExperimentAnalysis):
    def analysis_pvalue(self, df: pd.DataFrame, verbose: bool = True) -> float:
        treatment_data = df.query(f"{self.treatment_col} == 1")[self.target_col]
        control_data = df.query(f"{self.treatment_col} == 0")[self.target_col]
        t_test_results = ttest_ind(treatment_data, control_data, equal_var=False)
        return t_test_results.pvalue

In [5]:
cluster_cols = ["cluster", "date"]

# Right power analysis, we use clustered splitter and ols clustered analysis
pw_right = PowerAnalysis.from_dict(
    {
        "splitter": "clustered_balance",
        "perturbator": "uniform",
        "analysis": "ols_clustered",
        "cluster_cols": cluster_cols,
        "alpha": 0.05,
    },
)

# Wrong power analysis, we use clustered splitter and regular t-test
pw_wrong = PowerAnalysis(
    splitter=BalancedClusteredSplitter(
        cluster_cols=cluster_cols,
    ),
    perturbator=UniformPerturbator(),
    analysis=TTestAnalysis(
        # We pass cluster_cols here, but we don't use them!!!
        cluster_cols=cluster_cols,
    ),
    alpha=0.05,
)



Right way of doing it: in the AA test we get a power similar to the type I error of the test

In [6]:
pw_right.power_analysis(df, average_effect=0.0)

0.01

Wrong way of doing it: the AA test fails, we have too much power

In [7]:
pw_wrong.power_analysis(df, average_effect=0.0)

0.84