# Contents
- [Setup](#Setup)
- [Randomization Methods](#Randomization-Methods)
- [AA Testing](#AA-Testing)
- [Power Analysis](#Power-Analysis)
- [AB Testing](#AB-Testing)
- [How Long](#How-Long)
- [Results](#Results)
    - [Visualization](#Visualization)
- [Post Hoc Analysis](#Post-Hoc-Analysis)
- [Other Notes](#Other-Notes)
- [Scratch Notes](#Scratch-Notes)

# Setup

In [26]:
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Display all outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

[Back to the top](#Contents)
___


# Randomization Methods

- Ensures that differences in outcome metrics are due to the experiment and not pre-existing differences between users.
- Eliminates selection bias (e.g., users choosing their own group).
- Helps balance confounding variables (e.g., demographics, device type, purchase history).
- Enables valid statistical inference.

### Simple Randomization
- Each user has an equal chance (e.g., 50/50 split for A/B) of being assigned to treatment or control.
- Works well when sample sizes are large.

In [27]:
import numpy as np
import pandas as pd

# Sample user data
np.random.seed(42)  # For reproducibility
users = pd.DataFrame({'user_id': range(1, 101)})  # 100 users

# Assign each user to 'control' or 'treatment' with 50% probability
users['group'] = np.random.choice(['control', 'treatment'], size=len(users), p=[0.5, 0.5])

# Display the result
users

Unnamed: 0,user_id,group
0,1,control
1,2,treatment
2,3,treatment
3,4,treatment
4,5,control
...,...,...
95,96,control
96,97,treatment
97,98,control
98,99,control


### Stratified Sampling
- Ensures balance across key segments (e.g., country, platform, user tenure).
- Example: If 60% of your users are on iOS and 40% on Android, simple randomization might cause an imbalance, so you stratify by platform.

In [28]:
from sklearn.model_selection import train_test_split

# Create a sample dataset with 100 users and platform labels
np.random.seed(42)
users = pd.DataFrame({
    'user_id': range(1, 101),
    'platform': np.random.choice(['iOS', 'Android'], size=100, p=[0.6, 0.4])  # 60% iOS, 40% Android
})

# Stratify by platform to ensure balance
train, test = train_test_split(users, test_size=0.5, stratify=users['platform'], random_state=42)

# Assign groups
train['group'] = 'control'
test['group'] = 'treatment'

# Merge and display
users = pd.concat([train, test]).sort_values('user_id')
users

Unnamed: 0,user_id,platform,group
0,1,iOS,control
1,2,Android,control
2,3,Android,treatment
3,4,iOS,treatment
4,5,iOS,control
...,...,...,...
95,96,iOS,treatment
96,97,iOS,control
97,98,iOS,treatment
98,99,iOS,control


<!-- ### Blocked Randomization
- Ensures equal group sizes in small experiments.
- Example: If testing on 100 users, you create blocks of 10 and assign 5 to control and 5 to treatment within each block. -->


In [29]:
# Define block size (e.g., groups of 10 users)
block_size = 10
users = pd.DataFrame({'user_id': range(1, 101)})

# Assign blocks
users['block'] = (users['user_id'] - 1) // block_size

# Within each block, randomly assign 50% to control and 50% to treatment
users['group'] = users.groupby('block')['user_id'].transform(lambda x: np.random.choice(['control', 'treatment'], size=len(x), replace=True))

# Drop the block column after assignment
users = users.drop(columns=['block'])
users

Unnamed: 0,user_id,group
0,1,control
1,2,treatment
2,3,control
3,4,control
4,5,treatment
...,...,...
95,96,treatment
96,97,treatment
97,98,control
98,99,control


### Match Pair

Participants are paired based on similar characteristics before being randomly assigned to different groups. This ensures that treatment and control groups are balanced on key covariates, reducing variance and improving statistical power.

When to Use Matched-Pair Randomization?
- When you have a small sample size and need to control for confounders.
- When key characteristics (e.g., age, income, purchase history) could influence the outcome.
- When you want to minimize variance by ensuring similar individuals are in each group.

How It Works:
- Identify key variables that might impact the outcome (e.g., age, income, engagement level).
- Sort users based on these variables.
= Create pairs (or small groups) of users with similar characteristics.
- Within each pair, randomly assign one user to treatment and the other to control.


In [30]:
import numpy as np
import pandas as pd

# Generate sample user data
np.random.seed(42)
users = pd.DataFrame({
    'user_id': range(1, 101),
    'engagement_score': np.random.normal(50, 15, 100)  # Simulated user engagement scores
})

# Sort users by engagement to create pairs
users = users.sort_values(by='engagement_score').reset_index(drop=True)

# Assign treatment/control in pairs
users['group'] = np.where(users.index % 2 == 0, 'control', 'treatment')
users


Unnamed: 0,user_id,engagement_score,group
0,75,10.703823,control
1,80,20.186466,treatment
2,38,20.604948,control
3,14,21.300796,treatment
4,50,23.554398,control
...,...,...,...
95,4,72.845448,treatment
96,72,73.070548,control
97,74,73.469655,treatment
98,7,73.688192,control


### Cluster Randomization
- Assigns whole groups (e.g., entire cities or schools) instead of individuals.
- Useful when spillover effects are a concern (e.g., referral programs).

In [31]:
# Sample dataset with users belonging to different cities
np.random.seed(42)
users = pd.DataFrame({
    'user_id': range(1, 101),
    'city': np.random.choice(['New York', 'San Francisco', 'Chicago', 'Austin'], size=100)
})

# Randomly assign cities to control or treatment
unique_cities = users['city'].unique()
city_assignments = dict(zip(unique_cities, np.random.choice(['control', 'treatment'], size=len(unique_cities), replace=True)))

# Assign users based on their city
users['group'] = users['city'].map(city_assignments)
users

Unnamed: 0,user_id,city,group
0,1,Chicago,control
1,2,Austin,treatment
2,3,New York,treatment
3,4,Chicago,control
4,5,Chicago,control
...,...,...,...
95,96,San Francisco,treatment
96,97,San Francisco,treatment
97,98,Austin,treatment
98,99,San Francisco,treatment


### CUPED
- Uses historical data to reduce variance and increase test sensitivity.
- Doesn’t change how users are assigned but helps in post-analysis.

In [32]:
import statsmodels.api as sm

# Simulated pre-experiment metric (e.g., past purchase count)
np.random.seed(42)
users = pd.DataFrame({'user_id': range(1, 101)})
users['pre_experiment_metric'] = np.random.normal(loc=50, scale=10, size=100)  # Baseline metric

# Simple Randomization
users['group'] = np.random.choice(['control', 'treatment'], size=len(users), p=[0.5, 0.5])

# Compute CUPED adjustment
X = users[['pre_experiment_metric']]
X = sm.add_constant(X)  # Add intercept
y = np.random.normal(loc=0, scale=5, size=100)  # Simulated experiment outcome

# Regression to estimate theta (correction factor)
theta = sm.OLS(y, X).fit().params['pre_experiment_metric']

# Adjust the outcome using pre-experiment data
users['adjusted_outcome'] = y - theta * users['pre_experiment_metric']
users


Unnamed: 0,user_id,pre_experiment_metric,group,adjusted_outcome
0,1,54.967142,control,0.719184
1,2,48.617357,control,7.846275
2,3,56.476885,control,-0.651142
3,4,65.230299,control,14.377164
4,5,47.658466,treatment,3.695529
...,...,...,...,...
95,96,35.364851,control,-6.758427
96,97,52.961203,treatment,6.446120
97,98,52.610553,control,0.677294
98,99,50.051135,treatment,-4.311875


### Match Pair Randomization



Participants are paired based on similar characteristics before being randomly assigned to different groups. This ensures that treatment and control groups are balanced on key covariates, reducing variance and improving statistical power.

`When to Use`
- When you have a small sample size and need to control for confounders.
- When key characteristics (e.g., age, income, purchase history) could influence the outcome.
- When you want to minimize variance by ensuring similar individuals are in each group.

`How It Works`
- Identify key variables that might impact the outcome (e.g., age, income, engagement level).
- Sort users based on these variables.
- Create pairs (or small groups) of users with similar characteristics.
= Within each pair, randomly assign one user to treatment and the other to control.

In [41]:
import numpy as np
import pandas as pd

# Generate sample user data
np.random.seed(42)
users = pd.DataFrame({
    'user_id': range(1, 101),
    'engagement_score': np.random.normal(50, 15, 100)  # Simulated user engagement scores
})

# Sort users by engagement to create pairs
users = users.sort_values(by='engagement_score').reset_index(drop=True)

# Assign treatment/control in pairs
users['group'] = np.where(users.index % 2 == 0, 'control', 'treatment')
users

Unnamed: 0,user_id,engagement_score,group
0,75,10.703823,control
1,80,20.186466,treatment
2,38,20.604948,control
3,14,21.300796,treatment
4,50,23.554398,control
...,...,...,...
95,4,72.845448,treatment
96,72,73.070548,control
97,74,73.469655,treatment
98,7,73.688192,control


[Back to the top](#Contents)
___


# AA Testing

A/A testing is an experiment where both groups (control and treatment) receive the **same experience** to ensure that randomization is working correctly. It acts as a **sanity check** before running an A/B test.

### Importance**
- **Validates Randomization:** Ensures that groups are statistically similar before testing.
- **Detects Sample Ratio Mismatch (SRM):** Verifies that user assignment is balanced.
- **Estimates Variance for Power Analysis:** Helps understand variability before defining sample size for A/B testing.
- **Checks for Pre-Existing Bias:** Ensures no systematic differences exist between control and treatment groups.

### Process
1. **Randomly assign users** to two equal groups (just like an A/B test).
2. **Measure key metrics** (e.g., conversion rate, engagement, revenue).
3. **Perform statistical tests** (e.g., t-test for continuous data, chi-square for categorical data) to confirm no significant difference.
4. **Analyze Sample Ratio Mismatch (SRM)** to verify even split between groups.

### Result
- **No significant difference:** Randomization is working correctly, and the experiment is set up properly.
- **Significant difference detected:** Investigate potential issues such as randomization bugs, sample bias, or instrumentation errors.


[Back to the top](#Contents)
___


# Power Analysis
- Sample size and power analysis



[Back to the top](#Contents)
___

# AB Testing
- Experiment Implementation

[Back to the top](#Contents)
___


# Results 
- Analyzing results and check for statistical significance. # Lift Charts oir Confidence Intervals.

## Visualization

[Back to the top](#Contents)
___


# How Long

[Back to the top](#Contents)
___


# Post Hoc Analysis
- Post Hoc Analysis: Additional Tests after primary analysis to explore further patterns

[Back to the top](#Contents)
___


# Other Notes

[Back to the top](#Contents)
___


# Scratch Notes


In [33]:
# Simulate data
n_control = 5000  # Sample size for control group
n_treatment = 5000  # Sample size for treatment group

# Assume conversion rate is 10% for control and 12% for treatment
control_conversions = np.random.binomial(1, 0.10, n_control)
treatment_conversions = np.random.binomial(1, 0.12, n_treatment)

# Create DataFrame
df = pd.DataFrame({
    'group': ['control'] * n_control + ['treatment'] * n_treatment,
    'conversion': np.concatenate([control_conversions, treatment_conversions])
})
df.head()
df.shape

# Display summary
df.groupby('group')['conversion'].agg(['mean', 'count', 'sum'])


Unnamed: 0,group,conversion
0,control,0
1,control,0
2,control,0
3,control,0
4,control,1


(10000, 2)

Unnamed: 0_level_0,mean,count,sum
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
control,0.0976,5000,488
treatment,0.1128,5000,564


#### AB Testing

1. Chi-Square Test (Categorical Data - Conversion Rates)     
Used when testing differences in proportions (e.g., conversion rate increase).

In [34]:
from scipy.stats import chi2_contingency

# Create contingency table
conversion_table = pd.crosstab(df['group'], df['conversion'])

# Chi-Square Test
chi2_stat, p_value, dof, expected = chi2_contingency(conversion_table)

# Print results
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Check significance
alpha = 0.05  # 95% confidence level
if p_value < alpha:
    print("Reject the null hypothesis: Significant difference detected.")
else:
    print("Fail to reject the null hypothesis: No significant difference detected.")


Chi-Square Statistic: 5.9756
P-Value: 0.0145
Reject the null hypothesis: Significant difference detected.


2. Independent T-Test (Continuous Data - Avg. Time Spent, Revenue)    
Used when comparing means of continuous data between two independent groups.

In [35]:
from scipy.stats import ttest_ind

# Perform independent t-test
t_stat, p_val = ttest_ind(control_conversions, treatment_conversions)

print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_val:.4f}")

if p_val < alpha:
    print("Reject the null hypothesis: The new feature has a significant effect.")
else:
    print("Fail to reject the null hypothesis: No significant difference detected.")


T-Statistic: -2.4776
P-Value: 0.0132
Reject the null hypothesis: The new feature has a significant effect.


3. Paired T-Test (Before/After Tests - Same Users)    
Used when measuring differences within the same group (e.g., before vs. after).

In [36]:
from scipy.stats import ttest_rel

# Example: User engagement before & after a UI change
before = np.random.normal(200, 25, 100)  # Mean 200 sec, std dev 25
after = np.random.normal(210, 25, 100)  # Mean 210 sec, std dev 25

t_stat, p_val = ttest_rel(before, after)

print(f"Paired T-Test Statistic: {t_stat:.4f}")
print(f"P-Value: {p_val:.4f}")

if p_val < 0.05:
    print("Significant difference detected.")
else:
    print("No significant difference detected.")


Paired T-Test Statistic: -3.5348
P-Value: 0.0006
Significant difference detected.


4. Mann-Whitney U Test (Non-Normal Continuous Data)
A non-parametric test used when data isn’t normally distributed (e.g., skewed revenue).

In [37]:
from scipy.stats import mannwhitneyu

# Example: Revenue data (skewed distribution)
control_revenue = np.random.exponential(50, 5000)  # Skewed
treatment_revenue = np.random.exponential(55, 5000)  # Skewed

u_stat, p_val = mannwhitneyu(control_revenue, treatment_revenue)

print(f"Mann-Whitney U Statistic: {u_stat:.4f}")
print(f"P-Value: {p_val:.4f}")

if p_val < 0.05:
    print("Significant difference detected.")
else:
    print("No significant difference detected.")


Mann-Whitney U Statistic: 11697821.0000
P-Value: 0.0000
Significant difference detected.


 5. Bayesian A/B Testing (Alternative Approach)    
Instead of p-values, Bayesian A/B testing provides posterior probability distributions.

In [38]:
# import pymc3 as pm

# # Simulated conversion data
# control_conversions = np.sum(np.random.binomial(1, 0.10, 5000))
# treatment_conversions = np.sum(np.random.binomial(1, 0.12, 5000))

# with pm.Model():
#     control_rate = pm.Beta("control_rate", alpha=1, beta=1)
#     treatment_rate = pm.Beta("treatment_rate", alpha=1, beta=1)

#     # Priors
#     control = pm.Binomial("control", n=5000, p=control_rate, observed=control_conversions)
#     treatment = pm.Binomial("treatment", n=5000, p=treatment_rate, observed=treatment_conversions)

#     trace = pm.sample(2000, return_inferencedata=True)

# # Compute probability that the treatment is better
# prob_treatment_better = (trace.posterior["treatment_rate"] > trace.posterior["control_rate"]).mean().item()

# print(f"Probability that the treatment is better: {prob_treatment_better:.4f}")


6. Power Analysis (How Much Data Do You Need?)
Used to determine sample size before running an experiment.

In [39]:
from statsmodels.stats.power import TTestIndPower

# Define parameters
effect_size = 0.02  # Expected lift in conversion rate
alpha = 0.05  # Significance level
power = 0.8  # Desired statistical power

# Compute sample size per group
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size, power=power, alpha=alpha, ratio=1)

print(f"Required Sample Size per Group: {int(sample_size)}")


Required Sample Size per Group: 39245


## A-----B Testing

#### Data Creation

In [7]:
# Simulated data for A and B groups
np.random.seed(0)
group_A = np.random.normal(25, 5, 1000)  # Control group
group_B = np.random.normal(28, 5, 1000)  # Treatment group

In [10]:
# Calculate means and standard deviations
mean_A = np.mean(group_A)
mean_B = np.mean(group_B)
std_dev_A = np.std(group_A)
std_dev_B = np.std(group_B)

# Calculate the t-statistic and p-value
t_stat, p_value = stats.ttest_ind(group_A, group_B)

In [11]:
alpha = 0.05 # Define significance level (alpha)

# Check if the p-value is less than alpha (two-tailed test)
if p_value < alpha:
    print("Statistically significant difference between A and B groups")
    if mean_A < mean_B:
        print("Group B performs better.")
    else:
        print("Group A performs better.")
else:
    print("No statistically significant difference between A and B groups")

Statistically significant difference between A and B groups
Group B performs better.


In [12]:
def calculate_sample_size(alpha, power, baseline_conversion, min_detectable_effect):
    """
    Calculate the sample size required for an A/B test using a two-sample t-test.
    
    Parameters:
    alpha (float): Significance level (e.g., 0.05 for 5%)
    power (float): Desired power (e.g., 0.80 for 80%)
    baseline_conversion (float): Baseline conversion rate (proportion)
    min_detectable_effect (float): Minimum detectable effect (proportion)
    
    Returns:
    int: Sample size required for each group (control and treatment)
    """
    # Calculate the standard error
    std_error = np.sqrt((2 * baseline_conversion * (1 - baseline_conversion)) +
                        (baseline_conversion + min_detectable_effect) * (1 - baseline_conversion + min_detectable_effect))
    
    # Calculate the z-scores for alpha and power
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_power = stats.norm.ppf(power)
    
    # Calculate the sample size for each group
    sample_size = ((z_alpha + z_power) ** 2 * (baseline_conversion * (1 - baseline_conversion))) / min_detectable_effect**2
    
    return int(np.ceil(sample_size))


#### Example usage

In [13]:
alpha = 0.05
power = 0.80
baseline_conversion = 0.10  # 10% baseline conversion rate
min_detectable_effect = 0.02  # 2% minimum detectable effect

[Back to the top](#Contents)
___