# Contents
- [Data-Setup](#Data-Setup)
- [Randomization Methods](#Randomization-Methods)
- [AA Testing](#AA-Testing)
    - [SRM Check](#SRM-Check)
- [Power Analysis](#Power-Analysis)
- [AB Testing](#AB-Testing)
    - [Definitions](#Definitions)
    - [Results Summary](#Results-Summary)
    - [Test](#Test)
- [How Long](#How-Long)
- [Results](#Results)
    - [Summary](#Conversion-Rates)
    - [Visualization](#Visualization)
    - [Confidence Intervals Within Groups](#Confidence-Intervals-Outcomes)
    - [Confidence Intervals Across Groups](#Confidence-Intervals-Difference)
    - [Conclusion](#Conclusion)
- [Post Hoc Analysis](#Post-Hoc-Analysis)
    - [Segmented Lift](#Segmented-Lift)
    - [Gaurdrail Metrics](#Gaurdrail-Metrics)
    - [Rollout Simulation](#Rollout-Simulation)
- [Other Notes](#Other-Notes)
    - [Experimentation-Infrastructure](#Experimentation-Infrastructure)
- [Scratch Notes](#Scratch-Notes)
    - [Other Use Cases](#Other-Use-Cases)

<details>
<summary><strong>🧪 A/B Test - Decision Flow (click to expand)</strong></summary>

```
[What is your outcome type?]  
   |  
   +--> Binary (e.g., converted = 1 or 0, clicked or not)  
   |     |  
   |     +--> What are you comparing?  
   |           |  
   |           +--> Proportions (e.g., 10% vs 12% conversion rate)  
   |           |     |  
   |           |     +--> Comparing 2 groups ---------> Use Z-test  
   |           |     |                                 Compares success rates (proportions) between 2 groups  
   |           |     +--> Comparing 3+ groups --------> Use Chi-Square Test  
   |           |                                       Tests whether at least one group’s success rate differs; follow with pairwise Z-tests  
   |           +--> Counts (e.g., number of users who converted)  
   |                 +--> Comparing 2 groups ---------> Use Chi-Square Test  
   |                 +--> Comparing 3+ groups --------> Use Chi-Square Test  
   |  
   +--> Continuous (e.g., revenue, time spent, items bought)  
   |     +--> Comparing 2 groups  
   |     |     +--> Are the groups made of different users?  
   |     |           +--> Yes  
   |     |           |     +--> Is the outcome roughly normal?  
   |     |           |           +--> Yes ------------> Use Independent T-test  
   |     |           |           +--> No -------------> Use Mann-Whitney U Test  
   |     |           +--> No (same users before/after)  
   |     |                 +--> Is the outcome roughly normal?  
   |     |                       +--> Yes ------------> Use Paired T-test  
   |     |                       +--> No -------------> Use Wilcoxon Signed-Rank Test  
   |     +--> Comparing 3+ groups --------------------> Use ANOVA  
   |  
   +--> Categorical (e.g., selected A/B/C option)  
         +--> Comparing 2 or more groups -------------> Use Chi-Square Test  

[Other Scenarios]  
   +--> Want to control for other variables? ---------> Use Regression (Linear or Logistic)  
   +--> Prefer probability over p-values? ------------> Use Bayesian A/B Testing  

</details>


<details> <summary><strong>📊 A/B Test - Decision Flow Flattened Table (click to expand) </strong></summary>
    
| Outcome Type | What Are You Comparing?          | Group Count | Group Structure | Outcome Distribution | Statistical Test        | What It Does                                             | Example Problem Statement |
|--------------|----------------------------------|-------------|------------------|-----------------------|--------------------------|----------------------------------------------------------|----------------------------|
| Binary        | Proportions (% converted)        | 2           | Independent      | N/A                   | Z-test                   | Compares proportions between 2 groups                    | Does the new homepage increase conversion from 10% to 12%? |
| Binary        | Proportions                      | 3+          | Independent      | N/A                   | Chi-Square               | Tests if at least one group’s conversion rate differs    | Is there a significant difference in conversion across blue/orange/green CTA? |
| Binary        | Counts (e.g., #converted users)  | 2           | Independent      | N/A                   | Chi-Square               | Compares success/failure counts between groups           | Did 120 out of 1000 in group A convert vs 150 out of 1000 in group B? |
| Binary        | Counts                           | 3+          | Independent      | N/A                   | Chi-Square               | Compares categorical counts across multiple groups       | Do different signup flows lead to different conversion counts? |
| Continuous    | Mean of a metric (e.g., revenue) | 2           | Independent      | Normal                | Independent T-test       | Compares average outcome across 2 independent groups     | Does average order value differ between control and treatment? |
| Continuous    | Mean of a metric                 | 2           | Independent      | Not normal            | Mann-Whitney U           | Compares ranks/distributions between 2 independent groups| Is time-on-site higher in treatment group (skewed data)? |
| Continuous    | Before vs After (same users)     | 2           | Paired           | Normal                | Paired T-test            | Compares mean change for same users before and after     | Did users spend more on their second visit after UI update? |
| Continuous    | Before vs After (same users)     | 2           | Paired           | Not normal            | Wilcoxon Signed-Rank     | Compares paired non-normal outcomes                      | Did session duration increase for the same users post-change? |
| Continuous    | Mean outcome                     | 3+          | Independent      | Any                   | ANOVA                    | Compares means across 3 or more groups                   | Does average basket size differ across A/B/C pricing variants? |
| Categorical   | User-selected categories         | 2+          | Independent      | N/A                   | Chi-Square               | Tests distribution of categories across groups           | Do users pick different plans (Basic, Pro, Premium) across test groups? |
| Any           | Adjusting for other variables    | Any         | Any              | N/A                   | Regression (Linear/Logistic) | Measures treatment effect while controlling for covariates | Is treatment still effective after accounting for device and region? |
| Any           | Prefer probability > p-value     | Any         | Any              | N/A                   | Bayesian A/B Test        | Returns probability one group is better than the other   | What’s the probability green button outperforms blue? |
| Binary (Paired) | Conversion before vs after (same users) | 2     | Paired           | N/A                   | McNemar’s Test           | Tests change in conversion for same users                | Did logged-in users convert more after design change? |  
</details>

<details> <summary><strong>📊 When to Use Which Statistical Test in A/B Testing </strong></summary>

#### 🧪 When to Use Which Statistical Test in A/B Testing

| **Metric Type**        | **Example**                        | **Recommended Test**                      | **Why**                                                  |
|------------------------|------------------------------------|-------------------------------------------|-----------------------------------------------------------|
| Continuous             | Revenue, time on site, scores      | `scipy.stats.ttest_ind` (T-test)          | Compares means of two independent groups                 |
| Continuous (unequal variance) | Same as above               | `ttest_ind(..., equal_var=False)`         | Welch’s T-test — safer when variances differ             |
| Binary (0/1 outcomes)  | Conversion, click, purchase        | `statsmodels.stats.proportions_ztest`     | Compares proportions between two groups                  |
| Count data             | # pageviews, # items bought        | Poisson or Negative Binomial test         | For skewed count distributions                           |
| Non-parametric         | Ordinal/skewed data, NPS scores    | Mann-Whitney U test                       | No assumption of normality                               |
| Multiple groups (A/B/C)| Multi-variant tests                | ANOVA (continuous), Chi-squared (binary)  | Tests across 3+ groups                                   |

---

#### ✅ Quick Rules of Thumb:
- If your metric is **continuous + normal-ish** → Use **T-test**
- If it’s **binary (e.g., clicked or not)** → Use **Z-test**
- If it’s **non-normal or skewed** → Use **Mann-Whitney U test**
- If testing **3 or more variants** → Use **ANOVA** or **Chi-squared**

</details>

# Data Setup

In [1]:
# Display all outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import ttest_ind, chi2_contingency, ttest_ind, ttest_rel, mannwhitneyu
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.stats.power import TTestIndPower, TTestPower, FTestAnovaPower

# Set Seed 
my_seed=42
np.random.seed(my_seed)

##### Test Type

In [2]:
experiment_type = 'binary'  # options: 'binary', 'continuous_independent', 'continuous_paired', 'anova'
# Options:
#   'binary'                 → For conversion rate comparisons (Z-test)
#   'continuous_independent' → Comparing means across different users (T-test)
#   'continuous_paired'      → Comparing means for same users (Paired T-test)
#   'anova'                  → Comparing means across 3+ groups (ANOVA)

##### Sample user data

In [3]:
observations_count = 100 

np.random.seed(my_seed)  # For reproducibility
users = pd.DataFrame({
    'user_id': range(1, observations_count+1),
    'platform': np.random.choice(['iOS', 'Android'], size=observations_count, p=[0.6, 0.4]),  # 60% iOS, 40% Android
    'engagement_score': np.random.normal(50, 15, observations_count),  # Simulated user engagement scores
    'city': np.random.choice(['ny', 'sf', 'chicago', 'austin'], size=observations_count),
    'past_purchase_count': np.random.normal(loc=50, scale=10, size=observations_count)  # pre_experiment_metric for CUPED randomization
})
users

Unnamed: 0,user_id,platform,engagement_score,city,past_purchase_count
0,1,iOS,51.305706,sf,46.847308
1,2,Android,45.514890,ny,57.589692
2,3,Android,51.376412,sf,42.271748
3,4,iOS,20.186466,ny,47.631814
4,5,iOS,46.704922,austin,45.146365
...,...,...,...,...,...
95,96,iOS,37.762846,austin,47.978073
96,97,iOS,48.843474,austin,47.823188
97,98,iOS,55.117280,sf,60.987769
98,99,iOS,54.150362,austin,58.254163


[Back to the top](#Contents)
___


# Randomization Methods

Randomization is used to ensure that observed differences in outcome metrics are due to the experiment, not pre-existing differences.

- Prevents **selection bias** (e.g., users self-selecting into groups)  
- Balances **confounding factors** like platform, region, or past behavior  
- Enables **valid inference** through statistical testing

### 🎯 Simple Randomization  
Each user is assigned to control or treatment with **equal probability**, independent of any characteristics.

---
##### When to Use
- Sample size is **large enough** to ensure natural balance  
- No strong concern about **confounding variables**  
- Need a **quick, default assignment** strategy

##### How It Works
- Assign each user randomly (e.g., 50/50 split)  
- No grouping, segmentation, or blocking involved  
- Groups are expected to balance out on average  

In [4]:
def apply_simple_randomization(df, group_col='group', seed=my_seed):
    """
    Randomly assigns each row to 'control' or 'treatment' with equal probability.

    Parameters:
    - df: pandas DataFrame containing observations
    - group_col: name of the column to store group assignments
    - seed: random seed for reproducibility

    Returns:
    - DataFrame with an added group assignment column
    """
    np.random.seed(seed)
    df[group_col] = np.random.choice(['control', 'treatment'], size=len(df), p=[0.5, 0.5])
    return df

### Stratified Sampling  
Ensures that key segments (e.g., platform, region) are evenly represented across control and treatment.

---

##### When to Use
- User base is **naturally skewed** (e.g., 70% mobile, 30% desktop)  
- Important to control for **known confounders** like geography or device  
- You want balance **within subgroups**, not just overall

##### How It Works
- Pick a stratification variable (e.g., platform)  
- Split population into strata (groups)  
- Randomly assign users **within each stratum**  

In [5]:
def apply_stratified_randomization(df, stratify_col, group_col='group', seed=my_seed):
    """
    Performs stratified randomization to ensure balanced assignment across a key segment.

    Parameters:
    - df: pandas DataFrame to assign groups to
    - stratify_col: column to balance across (e.g., platform, region)
    - group_col: name of the column to store group assignments
    - seed: random seed for reproducibility

    Returns:
    - DataFrame with assigned groups, balanced across stratification column
    """
    # Split data into control and treatment while preserving distribution of the stratify column
    train, test = train_test_split(df, test_size=0.5, stratify=df[stratify_col], random_state=seed)

    # Assign groups manually
    train[group_col] = 'control'
    test[group_col] = 'treatment'

    # Combine and preserve original order (by index)
    return pd.concat([train, test]).sort_index()

### Block Randomization  
Groups users into fixed-size blocks and randomly assigns groups within each block.

---

##### When to Use
- Users arrive in **time-based batches** (e.g., daily cohorts)  
- Sample size is **small** and needs enforced balance  
- You want to minimize **temporal or ordering effects**

##### How It Works
- Create blocks based on order or ID (e.g., every 10 users)  
- Randomize assignments **within each block**  
- Ensures near-equal split in every batch  

In [6]:
def apply_block_randomization(
    df,
    observation_id_col,
    group_col='group',
    block_size=10,
    seed=42
):
    """
    Assigns treatment/control groups using block randomization.

    Parameters:
    - df: DataFrame to modify
    - observation_id_col: Unique row identifier (e.g., user_id, session_id)
    - group_col: Column to assign groups into
    - block_size: Number of observations per block
    - seed: Random seed for reproducibility

    Returns:
    - DataFrame with assigned group_col
    """
    np.random.seed(seed)

    df[group_col] = (
        df
        .assign(_block=(df[observation_id_col] - 1) // block_size)
        .groupby('_block')[observation_id_col]
        .transform(lambda x: np.random.choice(['control', 'treatment'], size=len(x), replace=True))
    )

    # Drop the block column if you want — optional
    # df = df.drop(columns=['_block'])

    return df

### Match Pair Randomization

Participants are **paired based on similar characteristics** before random group assignment.  
This reduces variance and improves **statistical power** by ensuring balance on key covariates.

---

##### When to Use
- Small sample size with high risk of **confounding**
- Outcomes influenced by user traits (e.g., **age, income, tenure**)  
- Need to **minimize variance** across groups

##### How It Works
1. Identify important covariates (e.g., age, purchase history)  
2. Sort users by those variables  
3. Create matched pairs (or small groups)  
4. Randomly assign one to **control**, the other to **treatment**  

In [7]:
def apply_matched_pair_randomization(df, sort_col, group_col='group'):
    """
    Assigns groups using matched-pair randomization based on a sorting variable.

    Parameters:
    - df: pandas DataFrame to assign groups to
    - sort_col: column used to sort users before pairing (e.g., engagement score)
    - group_col: name of the column to store group assignments

    Returns:
    - DataFrame with alternating control/treatment assignments within sorted pairs
    """
    # Sort by the matching variable to bring similar observations together
    df = df.sort_values(by=sort_col).reset_index(drop=True)

    # Alternate assignments to create matched pairs (control, treatment, control, ...)
    df[group_col] = np.where(df.index % 2 == 0, 'control', 'treatment')

    return df

### Cluster Randomization
Entire **groups or clusters** (e.g., cities, stores, schools) are assigned to control or treatment.  
Used when it's impractical or risky to randomize individuals within a cluster.

---

###### When to Use
- Users naturally exist in **groups** (e.g., teams, locations, devices)
- There's a risk of **interference** between users (e.g., word-of-mouth)
- Operational or tech constraints prevent individual-level randomization

###### How It Works
1. Define the cluster unit (e.g., store, city)  
2. Randomly assign each cluster to control or treatment  
3. All users within the cluster inherit the group assignment  

In [8]:
def apply_cluster_randomization(df, cluster_col, group_col='group', seed=my_seed):
    """
    Assigns groups using cluster-level randomization, where all observations in a cluster
    receive the same assignment.

    Parameters:
    - df: pandas DataFrame to assign groups to
    - cluster_col: column defining the cluster (e.g., city, store)
    - group_col: name of the column to store group assignments
    - seed: random seed for reproducibility

    Returns:
    - DataFrame with group assignments applied at the cluster level
    """
    np.random.seed(seed)

    # Randomly assign each cluster to control or treatment
    unique_clusters = df[cluster_col].unique()
    cluster_assignments = dict(
        zip(unique_clusters, np.random.choice(['control', 'treatment'], size=len(unique_clusters)))
    )

    # Map cluster-level assignments back to the full dataset
    df[group_col] = df[cluster_col].map(cluster_assignments)

    return df

### CUPED (Controlled Pre-Experiment Data)
A statistical adjustment that uses **pre-experiment behavior** to reduce variance and improve power.  
It helps detect smaller effects without increasing sample size.

##### When to Use
- You have reliable **pre-experiment metrics** (e.g., past spend, engagement)
- You want to **reduce variance** and improve test sensitivity
- You’re dealing with **small lifts** or **costly sample sizes**

##### How It Works
1. Identify a pre-period metric **correlated with your outcome**
2. Use regression to compute an adjustment (theta)  
3. Subtract the correlated component from your outcome metric  
4. Analyze the adjusted metric instead of the raw one  

In [9]:
def apply_cuped(df, pre_metric, group_col='group', outcome_col='adjusted_outcome', seed=my_seed):
    """
    Applies CUPED (Controlled Pre-Experiment Data) adjustment to reduce variance
    using a pre-experiment covariate.

    Parameters:
    - df: pandas DataFrame to assign groups and compute adjusted outcomes
    - pre_metric: name of the pre-experiment covariate column
    - group_col: name of the column to store group assignments
    - outcome_col: name of the new column to store the CUPED-adjusted outcome
    - seed: random seed for reproducibility

    Returns:
    - DataFrame with random assignment and adjusted outcome using CUPED
    """
    np.random.seed(seed)

    # Perform simple randomization (required before applying CUPED)
    df = apply_simple_randomization(df.copy(), group_col=group_col, seed=seed)

    # Simulate experiment outcome (in practice, this would be your actual metric)
    y = np.random.normal(loc=0, scale=5, size=len(df))

    # Regress outcome on pre-experiment metric to estimate theta (correction factor)
    X = sm.add_constant(df[[pre_metric]])
    theta = sm.OLS(y, X).fit().params[pre_metric]

    # Apply CUPED adjustment: reduce outcome variance using theta * pre-metric
    df[outcome_col] = y - theta * df[pre_metric]

    return df

In [10]:
method_map = {
    "simple": lambda df: apply_simple_randomization(df),
    "stratified": lambda df: apply_stratified_randomization(df, stratify_col='platform'),
    "block": lambda df: apply_block_randomization(users, observation_id_col='user_id', block_size=10),    
    "matched_pair": lambda df: apply_matched_pair_randomization(df, sort_col='engagement_score'),
    "cluster": lambda df: apply_cluster_randomization(df, cluster_col='city'),
    "cuped": lambda df: apply_cuped(df, pre_metric='past_purchase_count'),
}

randomization_method="simple"
if randomization_method not in method_map:
    raise ValueError(f"❌ Unsupported method: {randomization_method}")
    
users = method_map[randomization_method](users)
users

Unnamed: 0,user_id,platform,engagement_score,city,past_purchase_count,group
0,1,iOS,51.305706,sf,46.847308,control
1,2,Android,45.514890,ny,57.589692,treatment
2,3,Android,51.376412,sf,42.271748,treatment
3,4,iOS,20.186466,ny,47.631814,treatment
4,5,iOS,46.704922,austin,45.146365,control
...,...,...,...,...,...,...
95,96,iOS,37.762846,austin,47.978073,control
96,97,iOS,48.843474,austin,47.823188,treatment
97,98,iOS,55.117280,sf,60.987769,control
98,99,iOS,54.150362,austin,58.254163,control


[Back to the top](#Contents)
___


# AA Testing

# What specifically are we checking for???????

A/A testing is an experiment where both groups (control and treatment) receive the **same experience** to ensure that randomization is working correctly. It acts as a **sanity check** before running an A/B test.

##### Importance
- **Validates Randomization:** Ensures that groups are statistically similar before testing.
- **Detects Sample Ratio Mismatch (SRM):** Verifies that user assignment is balanced.
- **Estimates Variance for Power Analysis:** Helps understand variability before defining sample size for A/B testing.
- **Checks for Pre-Existing Bias:** Ensures no systematic differences exist between control and treatment groups.

##### Process
- **Randomly assign users** to two equal groups (just like an A/B test).
- **Measure key metrics** (e.g., conversion rate, engagement, revenue).
- **Perform statistical tests** (e.g., t-test for continuous data, chi-square for categorical data) to confirm no significant difference.
- **Analyze Sample Ratio Mismatch (SRM)** to verify even split between groups.

##### Result
- **No significant difference:** Randomization is working correctly, and the experiment is set up properly.
- **Significant difference detected:** Investigate potential issues such as randomization bugs, sample bias, or instrumentation errors.


In [None]:
np.random.seed(my_seed)

# Step 1: Simulate a base population of customers
n_customers = 2000
population = pd.DataFrame({
    'user_id': np.arange(1, n_customers + 1),
    'is_eligible': np.random.choice([0, 1], size=n_customers, p=[0.4, 0.6]),  # 60% eligible
    'engagement_score': np.random.normal(loc=50, scale=10, size=n_customers)  # some behavioral metric
})
print("population:\n")
population


In [None]:
# Step 2: Filter to eligible population
eligible_population = population[population['is_eligible'] == 1].copy()
n_eligible = len(eligible_population)


In [None]:
# Step 3: Randomly assign eligible users into two groups (A1 and A2)
eligible_population['group'] = np.random.choice(['A1', 'A2'], size=n_eligible, replace=True)
print("eligible_population:\n")
eligible_population


In [None]:
# Step 4: Split into the two groups
group_A1 = eligible_population[eligible_population['group'] == 'A1']['engagement_score']
group_A2 = eligible_population[eligible_population['group'] == 'A2']['engagement_score']
print("group_A1, group_A2:\n")
print(group_A1.head())
print(group_A2.head())

## SRM Check 
Is group assignment balanced?

- 🔍 SRM (Sample Ratio Mismatch) checks whether the observed group sizes match the expected ratio.
- In a perfect world, random assignment to 'A1' and 'A2' should give ~50/50 split.
- SRM helps catch bugs in randomization, data logging, or user eligibility filtering.

🎯 Real-World Experiment Split Ratios

| **Scenario**                     | **Split**              | **Why**                                 |
|----------------------------------|------------------------|------------------------------------------|
| Default A/B                      | 50 / 50                | Maximizes power and ensures fairness     |
| Risky feature                    | 10 / 90 or 20 / 80     | Limits user exposure to minimize risk    |
| Ramp-up                          | Step-wise (1-5-25-50…) | Gradual rollout to catch issues early    |
| A/B/C Test                       | 33 / 33 / 33 or weighted | Compare multiple variants fairly or with bias |
| High control confidence needed   | 70 / 30 or 60 / 40     | More stability in baseline comparisons   |


In [None]:
observed_counts = eligible_population['group'].value_counts().sort_index()
expected_counts = [n_eligible / 2, n_eligible / 2]

# Print observed counts and percentages
print("\n📊 Group Assignment Breakdown")
for group in observed_counts.index:
    count = observed_counts[group]
    pct = count / n_eligible * 100
    print(f"Group {group}: {count} users ({pct:.2f}%)")
    
# Chi-Square Goodness of Fit Test
chi2_stat, chi2_p = stats.chisquare(f_obs=observed_counts, f_exp=expected_counts)

print("\n🔍 SRM Check")
print(f"Chi2 Stat: {chi2_stat:.4f}")
print(f"P-value : {chi2_p:.4f}")
if chi2_p < 0.05:
    print("⚠️ Sample Ratio Mismatch detected — investigate assignment logic.")
else:
    print("✅ No SRM — group assignment is balanced.")

In [None]:
# Step 6: Run 2-sample (independent) t-test
t_stat, p_value = stats.ttest_ind(group_A1, group_A2)

print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
    print("⚠️ Statistically significant difference found — check randomization.")
else:
    print("✅ No significant difference — randomization looks good.")

# Step 7: Visualize distributions
plt.hist(group_A1, bins=30, alpha=0.5, label='Group A1');
plt.hist(group_A2, bins=30, alpha=0.5, label='Group A2');
plt.title('AA Test: Distribution Comparison');
plt.legend();
plt.show();

[Back to the top](#Contents)
___


# Power Analysis

Power analysis helps determine the **minimum sample size** needed to detect an expected effect with statistical confidence.

##### Why It Matters:
- Avoids **underpowered** tests (can't detect real differences)
- Avoids **overpowered** tests (wastes resources)
- Balances tradeoffs between **sample size**, **effect size**, **confidence level**, and **statistical power**

##### Key Inputs:
- **alpha (α):** Significance level (probability of Type I error, usually 0.05
- **Power (1 - β):** Probability of detecting a true effect (commonly 0.80 or 0.90)
- **Baseline:** Current performance (e.g., 10% conversion rate)
- **Minimum Detectable Effect (MDE):** Smallest lift you care to detect (e.g., +2%)

We use a two-sample z-test for proportions to estimate sample size per group.

In [11]:
def calculate_power_sample_size(
    experiment_type,
    alpha=0.05,
    power=0.80,
    baseline_rate=None,   # For binary: e.g., 0.10 = 10% conversion rate
    mde=None,             # Minimum Detectable Effect (absolute change)
                          # e.g., 0.02 for 2% lift in binary, or $5 increase in continuous
    std_dev=None,         # Std deviation of outcome for continuous outcomes
    effect_size=None,     # Cohen's d (for t-tests) or f (for ANOVA)
    num_groups=2          # Required only for ANOVA (≥3)
):
    """
    Generalized sample size calculator for A/B testing based on experiment type.

    Parameters:
    ----------
    experiment_type : str
        Type of experiment. One of:
        - 'binary'                  : Binary outcome (Z-test on proportions)
        - 'continuous_independent' : Continuous outcome, different users (t-test)
        - 'continuous_paired'      : Continuous outcome, same users (paired t-test)
        - 'anova'                  : Continuous outcome, 3+ groups (ANOVA)

    alpha : float
        Significance level (Type I error), usually 0.05

    power : float
        Desired power of the test, e.g., 0.80 for 80%

    baseline_rate : float
        Current conversion rate (only for binary tests), e.g., 0.10 = 10%

    mde : float
        Minimum detectable effect. Examples:
        - Binary: 0.02 → detect +2% lift
        - Continuous: 5 → detect $5 increase

    std_dev : float
        Standard deviation of the metric (for continuous outcomes)

    effect_size : float
        Optional: if you already know Cohen’s d (for t-tests) or f (for ANOVA)

    num_groups : int
        Only for ANOVA. Must be ≥ 3.

    Returns:
    -------
    int
        Required sample size per group
    """

    # ----- 1. Binary outcome -----
    if experiment_type == 'binary':
        if baseline_rate is None or mde is None:
            raise ValueError("baseline_rate and mde are required for binary tests.")
        
        z_alpha = stats.norm.ppf(1 - alpha / 2)
        z_beta = stats.norm.ppf(power)
        p1 = baseline_rate
        p2 = p1 + mde
        pooled_std = np.sqrt(2 * p1 * (1 - p1))

        n = ((z_alpha + z_beta) ** 2 * pooled_std ** 2) / (mde ** 2)
        return int(np.ceil(n))

    # ----- 2. Continuous: Independent groups -----
    elif experiment_type == 'continuous_independent':
        if std_dev is None and effect_size is None:
            raise ValueError("Provide std_dev or effect_size for continuous_independent test.")
        
        if effect_size is None:
            effect_size = mde / std_dev
        
        analysis = TTestIndPower()
        n = analysis.solve_power(effect_size=effect_size, power=power, alpha=alpha)
        return int(np.ceil(n))

    # ----- 3. Continuous: Paired test -----
    elif experiment_type == 'continuous_paired':
        if std_dev is None and effect_size is None:
            raise ValueError("Provide std_dev or effect_size for continuous_paired test.")
        
        if effect_size is None:
            effect_size = mde / std_dev
        
        analysis = TTestPower()
        n = analysis.solve_power(effect_size=effect_size, power=power, alpha=alpha)
        return int(np.ceil(n))

    # ----- 4. ANOVA -----
    elif experiment_type == 'anova':
        if effect_size is None or num_groups < 3:
            raise ValueError("effect_size and num_groups >= 3 are required for ANOVA.")
        
        analysis = FTestAnovaPower()
        n = analysis.solve_power(effect_size=effect_size, power=power, alpha=alpha, k_groups=num_groups)
        return int(np.ceil(n))

    else:
        raise ValueError(f"Unsupported experiment_type: {experiment_type}")


In [14]:
# 🎯 Experiment assumptions
alpha = 0.05                  # Type I error tolerance
power = 0.80                  # 1 - β (probability of detecting a true effect)
baseline_rate = 0.10          # For binary: current conversion rate = 10%
mde = 0.02                    # Minimum detectable lift = +2% (or $5, etc.)
std_dev = None                # Only for continuous outcomes (e.g., revenue)
effect_size = None            # Optional, used if std_dev is not given
num_groups = 3                # Only needed for ANOVA (≥ 3)

# 🧮 Sample size calculation
required_sample_size = calculate_power_sample_size(
    experiment_type=experiment_type,
    alpha=alpha,
    power=power,
    baseline_rate=baseline_rate,
    mde=mde,
    std_dev=std_dev,
    effect_size=effect_size,
    num_groups=num_groups
)
print("required_sample_size per group:", required_sample_size)

required_sample_size per group: 3532


In [18]:
def print_power_analysis_summary(
    experiment_type,
    alpha,
    power,
    baseline_rate=None,
    mde=None,
    std_dev=None,
    effect_size=None,
    num_groups=2,
    required_sample_size=None
):
    print("📈 Power Analysis Summary")
    print(f"- Significance level (α): {alpha}")
    print(f"- Statistical power (1 - β): {power}")
    
    if experiment_type == 'binary':
        print(f"- Baseline conversion rate: {baseline_rate}")
        print(f"- Minimum detectable effect (MDE): {mde}")
        print(f"- Target conversion rate: {baseline_rate + mde:.2f}")
        print(f"\n✅ Result:")
        print(f"To detect a lift from {baseline_rate:.2f} to {baseline_rate + mde:.2f},")
        print(f"you need {required_sample_size} users in each group (control and treatment).")
        print(f"Total sample size: {required_sample_size * 2} users.")
    
    elif experiment_type == 'continuous_independent':
        if std_dev:
            print(f"- Baseline Std Dev: {std_dev}")
            print(f"- MDE (raw difference to detect): {mde}")
            print(f"- Cohen's d (calculated): {mde / std_dev:.2f}")
        else:
            print(f"- Effect size (Cohen's d): {effect_size}")
        
        print(f"\n✅ Result:")
        print(f"To detect a mean difference of {mde} between two independent groups,")
        print(f"you need {required_sample_size} users per group (total {required_sample_size * 2}).")
    
    elif experiment_type == 'continuous_paired':
        if std_dev:
            print(f"- Std Dev of differences: {std_dev}")
            print(f"- MDE (within-user change): {mde}")
            print(f"- Cohen's d (calculated): {mde / std_dev:.2f}")
        else:
            print(f"- Effect size (Cohen's d): {effect_size}")
        
        print(f"\n✅ Result:")
        print(f"To detect a mean change of {mde} within the same users,")
        print(f"you need data from {required_sample_size} users.")

    elif experiment_type == 'anova':
        print(f"- Number of groups: {num_groups}")
        print(f"- Effect size (Cohen's f): {effect_size}")
        print(f"\n✅ Result:")
        print(f"To detect differences in mean outcome across {num_groups} groups,")
        print(f"you need {required_sample_size} users per group ({required_sample_size * num_groups} total).")

    else:
        print("⚠️ Unknown experiment type — no summary printed.")

In [20]:
print_power_analysis_summary(
    experiment_type=experiment_type,
    alpha=alpha,
    power=power,
    baseline_rate=baseline_rate,
    mde=mde,
    std_dev=std_dev,
    effect_size=effect_size,
    num_groups=num_groups,
    required_sample_size=required_sample_size
)


📈 Power Analysis Summary
- Significance level (α): 0.05
- Statistical power (1 - β): 0.8
- Baseline conversion rate: 0.1
- Minimum detectable effect (MDE): 0.02
- Target conversion rate: 0.12

✅ Result:
To detect a lift from 0.10 to 0.12,
you need 3532 users in each group (control and treatment).
Total sample size: 7064 users.


[Back to the top](#Contents)
___

# AB Testing

### 🛠️ Metric Tracked:
- **Primary metric:** Conversion rate (binary: clicked = 1, did not click = 0)
- **Unit of analysis:** Unique user

---

#### 📈 Outcome Analysis Plan:
- Two-sample **z-test for proportions** to compare conversion rates
- Compute confidence intervals and p-values
- Optional: visualizations of effect size and confidence bounds

In [None]:
np.random.seed(my_seed)

# Step 1: Simulate eligible population
n = 7064  # from power analysis
users = pd.DataFrame({
    'user_id': np.arange(1, n + 1),
})

# Step 2: Randomly assign users to control or treatment
users['group'] = np.random.choice(['control', 'treatment'], size=n, replace=True)
users

In [None]:
# Step 3: Simulate conversions
# Assume baseline = 0.10, treatment = 0.12
conversion_rate = {
    'control': 0.10,
    'treatment': 0.12
}
users['converted'] = users['group'].apply(lambda g: np.random.binomial(1, conversion_rate[g]))
users

In [None]:
# Step 4: View summary
summary = users.groupby('group')['converted'].agg(['count', 'sum', 'mean']).rename(columns={
    'count': 'n_users',
    'sum': 'n_converted',
    'mean': 'conversion_rate'
})
summary

In [None]:
# Step 5: Run 2-proportion z-test
control_conv = summary.loc['control', 'conversion_rate']
treatment_conv = summary.loc['treatment', 'conversion_rate']
n_control = summary.loc['control', 'n_users']
n_treatment = summary.loc['treatment', 'n_users']
x_control = summary.loc['control', 'n_converted']
x_treatment = summary.loc['treatment', 'n_converted']

# Pooled conversion rate
p_pooled = (x_control + x_treatment) / (n_control + n_treatment)
se_pooled = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_control + 1/n_treatment))

z_stat = (treatment_conv - control_conv) / se_pooled
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

[Back to the top](#Contents)
___


# Results

#### Conversion Rates

In [None]:
print(f"\n📊 Z-Test Results")
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value    : {p_value:.4f}")
if p_value < 0.05:
    print("✅ Statistically significant difference detected.")
else:
    print("🚫 No significant difference detected.")

In [None]:
# Already generated in earlier step, but good to reinforce
summary

#### Visualization

In [None]:
# Plot bar chart
fig, ax = plt.subplots()
bars = ax.bar(['Control', 'Treatment'],
              [control_conv, treatment_conv],
              color=['gray', 'skyblue'])

# Add values on top
for bar in bars:
    height = bar.get_height();
    ax.annotate(f'{height:.4f}',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 5),  # vertical offset
                textcoords="offset points",
                ha='center', va='bottom');
ax.set_ylabel('Conversion Rate');
ax.set_title('A/B Test: Conversion Rate by Group');
ax.set_ylim(0, max(control_conv, treatment_conv) + 0.02);
ax.grid(axis='y', linestyle='--', alpha=0.7);
plt.show();

#### 95% Confidence Intervals (`outcome of group`)

- The 95% confidence interval gives a range in which we expect the **true conversion rate** to fall for each group.
- If the confidence intervals **do not overlap**, it's strong evidence that the difference is statistically significant.
- If they **do overlap**, it doesn't guarantee insignificance — you still need the p-value to decide — but it suggests caution when interpreting lift.

In [None]:
# Compute confidence intervals
def compute_ci(p, n, z=1.96):
    se = np.sqrt(p * (1 - p) / n)
    return (p - z*se, p + z*se)

ci_control = compute_ci(control_conv, n_control)
ci_treatment = compute_ci(treatment_conv, n_treatment)

# Plot with error bars
plt.errorbar(['Control', 'Treatment'],
             [control_conv, treatment_conv],
             yerr=[[control_conv - ci_control[0], treatment_conv - ci_treatment[0]],
                   [ci_control[1] - control_conv, ci_treatment[1] - treatment_conv]],
             fmt='o', capsize=10, color='black')
plt.ylabel('Conversion Rate');
plt.title('Conversion Rate with 95% Confidence Intervals');
plt.grid(axis='y', linestyle='--', alpha=0.7);
plt.show();

#### 95% Confidence Intervals (`difference in outcomes`). AKA `Lift Analysis`

In [None]:
# Calculate absolute lift
lift = treatment_conv - control_conv

# Standard error for the difference in proportions
se_diff = np.sqrt(
    (control_conv * (1 - control_conv) / n_control) +
    (treatment_conv * (1 - treatment_conv) / n_treatment)
)

# 95% Confidence Interval for the lift
z = 1.96  # for 95%
ci_lower = lift - z * se_diff
ci_upper = lift + z * se_diff

# Print result
print("📈 Lift Analysis")
print(f"- Absolute Lift: {lift:.4f}")
print(f"- 95% Confidence Interval for Lift: [{ci_lower:.4f}, {ci_upper:.4f}]")

# Interpretation
if ci_lower > 0:
    print("✅ We're 95% confident the new version improved conversion.")
elif ci_upper < 0:
    print("🚫 We're 95% confident the new version hurt conversion.")
else:
    print("🤷 The confidence interval includes 0 — we can't say the lift is statistically significant.")


#### Final Conclusion

In [None]:
print("" + "="*40)
print("          📊FINAL A/B TEST SUMMARY")
print("="*40)

print(f"👥  Control conversion rate   :  {control_conv:.4f}")
print(f"🧪  Treatment conversion rate :  {treatment_conv:.4f}")
print(f"📈  Absolute lift             :  {(treatment_conv - control_conv):.4f}")
print(f"📊  Percentage lift           :  {((treatment_conv - control_conv)/control_conv):.2%}")
print(f"🧪  P-value (from z-test)     :  {p_value:.4f}")

print("-" * 40)

if p_value < 0.05:
    print("✅  RESULT: Statistically significant improvement detected.")
else:
    print("❌  RESULT: No statistically significant difference detected.")

print("="*40 + "\n")

[Back to the top](#Contents)
___


# How Long
##### to run the test?


The runtime of an A/B test depends on how quickly you can reach the required **sample size per group**, as estimated during power analysis.

##### Key Inputs:
- ✅ Daily eligible traffic volume
- ✅ Required sample size (from power analysis)
- ✅ Whether traffic is split 50/50 or unevenly

##### Formula:
> **Days = Required Sample Size per Group / (Daily Eligible Users × Group Split Proportion)**

This ensures the experiment runs **long enough to detect the effect** with the desired statistical confidence.

##### ⏳ How Long to Run an A/B Test — Real-World Guide

1. **Estimate sample size** needed (done via power analysis)
2. **Understand your traffic** — how many eligible users per day?
3. **Factor in group split** — if it's 50/50, each group gets half the traffic
4. **Divide required sample per group by daily users per group** to estimate days

---

##### 💡 Real-World Recommendations

- ✅ **Ramp-Up Period**: Don’t go full traffic on Day 1. Start with 5%, then 25%, then 50% over 2–3 days. This is safer and helps catch bugs early.
  
- ✅ **Cool-Down Buffer**: Let the test stabilize before analysis. Avoid cutting off during weekends or anomalies.
  
- ✅ **Trust Checks**:
  - Run an **A/A test first** to validate setup
  - Do an **SRM check** to confirm assignment balance
  - Monitor **guardrail metrics** (e.g., bounce rate, latency)

- ⏳ Add **1–2 buffer days** for these. It’s not just about stats — it’s about reliability and business trust.

---

##### 🧠 Tip
> “We typically recommend calculating sample size from power analysis, then dividing by daily traffic per variant. But we also factor in buffer days for ramp-up, trust checks, and traffic noise — just to make sure we don’t rush analysis before data is stable.”

> “We use power analysis to plan — it gives stakeholders a timeline and sets expectations.
But we don’t blindly stop based on sample size. I monitor SRM, metric stability, cohort coverage, and confidence intervals before making the call. We want decisions that are trustworthy, not just statistically complete.”

In [None]:
def estimate_total_test_duration(required_sample_size_per_group, daily_eligible_users, allocation_ratio=0.5):
    """
    Estimate total days needed to complete an A/B test for both groups.

    Parameters:
    - required_sample_size_per_group: int, sample size needed per group (from power analysis)
    - daily_eligible_users: int, total eligible users arriving each day
    - allocation_ratio: float, proportion of traffic sent to each group (e.g., 0.5 for 50/50)

    Returns:
    - Estimated total number of days to complete both groups
    """
    # Max group load per day (assuming symmetric or asymmetric allocation)
    daily_users_per_group = daily_eligible_users * allocation_ratio
    days_needed = required_sample_size_per_group / daily_users_per_group

    return int(np.ceil(days_needed))

# Example Usage
required_sample = 3532
daily_users = 10000
allocation = 0.5  # 50% of users per group

total_days = estimate_total_test_duration(required_sample, daily_users, allocation)
buffer_days = 2  # For ramp-up, cool-down, or traffic anomalies

print(f"📅 Estimated minimum duration per group : {total_days} days")
print(f"📦 Add buffer (ramp-up, weekends, trust checks): {buffer_days} days")
print(f"🧮 Total recommended runtime             : {total_days + buffer_days} days\n")


#### Monitoring Dashboard
- Guardrails
-

[Back to the top](#Contents)
___


# Post Hoc Analysis

> 🔍 After statistical significance, we do a post-hoc dive. We check for lift by user segment, inspect guardrail metrics, and simulate rollout impact. This turns A/B tests into real product decisions.

Post-hoc analysis helps interpret test results **beyond the primary metric**, and answer key follow-up questions like:
- Did the effect differ by segment (e.g. platform, user type)?
- Were any guardrail metrics negatively impacted?
- What does the lift mean in terms of actual revenue or conversions?
- What happens if we roll this out 100%?

This step helps turn **statistical significance into business confidence.**

#### Segmented Lift

In [None]:
# Let's add a few realistic segmentation features to the users DataFrame
# (e.g., platform, device_type, user_tier, region)

np.random.seed(my_seed)  # for reproducibility

users['platform'] = np.random.choice(['iOS', 'Android'], size=len(users))
users['device_type'] = np.random.choice(['mobile', 'desktop'], size=len(users), p=[0.7, 0.3])
users['user_tier'] = np.random.choice(['new', 'returning'], size=len(users), p=[0.4, 0.6])
users['region'] = np.random.choice(['North', 'South', 'East', 'West'], size=len(users))

# Return a preview of the updated DataFrame
users.head()

In [None]:
def segmented_lift_analysis(df, segment_cols, outcome_col='converted'):
    for segment in segment_cols:
        print(f"\n📊 Segment: {segment}")
        print("-" * 40)

        # 1. Aggregate conversion metrics
        grouped = df.groupby([segment, 'group'])[outcome_col].agg(['count', 'sum', 'mean']).reset_index()
        grouped.columns = [segment, 'group', 'n_users', 'n_converted', 'conversion_rate']

        # 2. Pivot to calculate absolute and percentage lift
        pivoted = grouped.pivot(index=segment, columns='group', values='conversion_rate').reset_index()
        pivoted['absolute_lift'] = pivoted['treatment'] - pivoted['control']
        pivoted['percentage_lift'] = pivoted['absolute_lift'] / pivoted['control']
        print(pivoted[[segment, 'control', 'treatment', 'absolute_lift', 'percentage_lift']].to_string(index=False))

        # 3. Plot
        plt.figure(figsize=(10, 4))
        ax = sns.barplot(data=grouped, x=segment, y='conversion_rate', hue='group')

        # 4. Add value labels
        for bar in ax.patches:
            height = bar.get_height()
            ax.annotate(f"{height:.3f}", 
                        (bar.get_x() + bar.get_width() / 2, height + 0.005),
                        ha='center', va='bottom', fontsize=9)

        # 5. Style tweaks — fixed layout
        ax.set_title(f'Conversion Rate by {segment} and Group', pad=15)
        ax.set_ylabel('Conversion Rate')
        ax.set_xlabel(segment)
        ax.grid(axis='y', linestyle='--', alpha=0.6)
        ax.set_ylim(0, grouped['conversion_rate'].max() + 0.05)

        # 👇 FIXED LEGEND POSITION
        ax.legend(
            # title='Group', 
            loc='upper right', 
            fontsize='small', 
            title_fontsize='small',
            frameon=True,
            borderpad=0.5)

        plt.tight_layout()
        plt.show()

In [None]:
segmented_columns = ['platform', 'device_type', 'user_tier', 'region']
segmented_lift_analysis(users, segmented_columns)

#### Guardrail Metric Checks

Guardrail metrics are **non-primary metrics** that help ensure an experiment isn't causing unintended harm.

We monitor them alongside the main success metric to:
- Detect regressions in user experience or system performance
- Catch trade-offs early (e.g., improved conversions but worse bounce rate)

---

📊 Common Guardrail Metrics
- Bounce Rate — did the experience frustrate users?
- Page Load Time / Latency — did the feature slow down the UI?
- Session Length / Engagement — did users stick around?
- Error Rate — did the experiment introduce bugs?

---

✅ When to Act:
- If the **treatment worsens a guardrail metric**, even if the primary metric improved → we pause, investigate, or reject the rollout.
Guardrails aren’t always deal-breakers, but they’re **trust checks** that make A/B test results *safe* to act on.

In [None]:
# Add a simulated guardrail metric: bounce_rate (continuous between 0.1 and 0.9)
# Bounce rate is typically higher when users disengage quickly

np.random.seed(my_seed)
users['bounce_rate'] = np.where(
    users['converted'] == 1,
    np.random.normal(loc=0.2, scale=0.05, size=len(users)),  # lower bounce if converted
    np.random.normal(loc=0.6, scale=0.1, size=len(users))   # higher bounce if not
)

# Clip to stay between 0 and 1
users['bounce_rate'] = users['bounce_rate'].clip(0, 1)

# Return sample for inspection
users.head()

In [None]:
# Example if you had bounce_rate as a column:
guardrail = users.groupby('group')['bounce_rate'].mean()
print("🚦 Average Bounce Rate by Group:")
print(guardrail)

In [None]:
# 1. Group means
bounce_by_group = users.groupby('group')['bounce_rate'].mean()
bounce_diff = bounce_by_group['treatment'] - bounce_by_group['control']

# 2. Optional statistical test (t-test)
control_bounce = users[users['group'] == 'control']['bounce_rate']
treatment_bounce = users[users['group'] == 'treatment']['bounce_rate']
t_stat, p_val = ttest_ind(treatment_bounce, control_bounce)

# 3. Print summary and conclusion
print("🚦 Guardrail Check")
print(f"- Control  : {bounce_by_group['control']:.4f}")
print(f"- Treatment: {bounce_by_group['treatment']:.4f}")
print(f"- Difference: {bounce_diff:+.4f}")
print(f"- P-value (t-test): {p_val:.4f}")

# 4. Conclusion
if p_val < 0.05:
    if bounce_diff > 0:
        print("❌ Statistically significant *increase* in bounce rate — potential UX regression.")
    else:
        print("✅ Statistically significant *decrease* in bounce rate — UX may have improved.")
else:
    print("🟡 No significant change in bounce rate — guardrail looks stable.")

#### Rollout Simulation

In [None]:
# Assume full exposure to daily_eligible_users
daily_impact = (treatment_conv - control_conv) * daily_eligible_users
monthly_impact = daily_impact * 30

print(f"📈 Estimated additional conversions per day: {daily_impact:.0f}")
print(f"📈 Estimated additional conversions per month: {monthly_impact:.0f}")

[Back to the top](#Contents)
___


# Scratch Notes


In [None]:
# Simulate data
n_control = 5000  # Sample size for control group
n_treatment = 5000  # Sample size for treatment group

# Assume conversion rate is 10% for control and 12% for treatment
control_conversions = np.random.binomial(1, 0.10, n_control)
treatment_conversions = np.random.binomial(1, 0.12, n_treatment)

# Create DataFrame
df = pd.DataFrame({
    'group': ['control'] * n_control + ['treatment'] * n_treatment,
    'conversion': np.concatenate([control_conversions, treatment_conversions])
})
df.head()
df.shape

# Display summary
df.groupby('group')['conversion'].agg(['mean', 'count', 'sum'])


#### AB Testing

1. Chi-Square Test (Categorical Data - Conversion Rates)     
Used when testing differences in proportions (e.g., conversion rate increase).

In [None]:
# Create contingency table
conversion_table = pd.crosstab(df['group'], df['conversion'])

# Chi-Square Test
chi2_stat, p_value, dof, expected = chi2_contingency(conversion_table)

# Print results
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Check significance
alpha = 0.05  # 95% confidence level
if p_value < alpha:
    print("Reject the null hypothesis: Significant difference detected.")
else:
    print("Fail to reject the null hypothesis: No significant difference detected.")


2. Independent T-Test (Continuous Data - Avg. Time Spent, Revenue)    
Used when comparing means of continuous data between two independent groups.

In [None]:
# Perform independent t-test
t_stat, p_val = ttest_ind(control_conversions, treatment_conversions)

print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_val:.4f}")

if p_val < alpha:
    print("Reject the null hypothesis: The new feature has a significant effect.")
else:
    print("Fail to reject the null hypothesis: No significant difference detected.")


3. Paired T-Test (Before/After Tests - Same Users)    
Used when measuring differences within the same group (e.g., before vs. after).

In [None]:
# Example: User engagement before & after a UI change
before = np.random.normal(200, 25, 100)  # Mean 200 sec, std dev 25
after = np.random.normal(210, 25, 100)  # Mean 210 sec, std dev 25

t_stat, p_val = ttest_rel(before, after)

print(f"Paired T-Test Statistic: {t_stat:.4f}")
print(f"P-Value: {p_val:.4f}")

if p_val < 0.05:
    print("Significant difference detected.")
else:
    print("No significant difference detected.")

4. Mann-Whitney U Test (Non-Normal Continuous Data)
A non-parametric test used when data isn’t normally distributed (e.g., skewed revenue).

In [None]:
# Example: Revenue data (skewed distribution)
control_revenue = np.random.exponential(50, 5000)  # Skewed
treatment_revenue = np.random.exponential(55, 5000)  # Skewed

u_stat, p_val = mannwhitneyu(control_revenue, treatment_revenue)

print(f"Mann-Whitney U Statistic: {u_stat:.4f}")
print(f"P-Value: {p_val:.4f}")

if p_val < 0.05:
    print("Significant difference detected.")
else:
    print("No significant difference detected.")


 5. Bayesian A/B Testing (Alternative Approach)    
Instead of p-values, Bayesian A/B testing provides posterior probability distributions.

In [None]:
# import pymc3 as pm

# # Simulated conversion data
# control_conversions = np.sum(np.random.binomial(1, 0.10, 5000))
# treatment_conversions = np.sum(np.random.binomial(1, 0.12, 5000))

# with pm.Model():
#     control_rate = pm.Beta("control_rate", alpha=1, beta=1)
#     treatment_rate = pm.Beta("treatment_rate", alpha=1, beta=1)

#     # Priors
#     control = pm.Binomial("control", n=5000, p=control_rate, observed=control_conversions)
#     treatment = pm.Binomial("treatment", n=5000, p=treatment_rate, observed=treatment_conversions)

#     trace = pm.sample(2000, return_inferencedata=True)

# # Compute probability that the treatment is better
# prob_treatment_better = (trace.posterior["treatment_rate"] > trace.posterior["control_rate"]).mean().item()

# print(f"Probability that the treatment is better: {prob_treatment_better:.4f}")


6. Power Analysis (How Much Data Do You Need?)
Used to determine sample size before running an experiment.

In [None]:
# Define parameters
effect_size = 0.02  # Expected lift in conversion rate
alpha = 0.05  # Significance level
power = 0.8  # Desired statistical power

# Compute sample size per group
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size, power=power, alpha=alpha, ratio=1)

print(f"Required Sample Size per Group: {int(sample_size)}")

# Other Notes

### Experimentation Infrastructure

[Back to the top](#Contents)
___