# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 04 · Notebook 03 – A/B Testing Framework
**Instructor:** Amir Charkhi  |  **Goal:** Build a complete A/B testing framework for data-driven decisions.

> Format: short theory → quick practice → build understanding → mini-challenges.


---
## Learning Objectives
- Design and implement A/B tests end-to-end
- Build reusable testing framework
- Handle real-world complications
- Create actionable reports

## 1. A/B Test Design: From Question to Experiment
The complete workflow for running experiments.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# A/B Test Planning Template
test_plan = {
    'test_name': 'Homepage CTA Button Color',
    'hypothesis': 'Green button will increase CTR by 10%',
    'primary_metric': 'click_through_rate',
    'secondary_metrics': ['bounce_rate', 'time_on_site'],
    'baseline_ctr': 0.05,  # 5%
    'minimum_detectable_effect': 0.005,  # 0.5% absolute
    'significance_level': 0.05,
    'power': 0.8,
    'test_duration_days': 14
}

print("A/B TEST PLAN")
print("="*50)
for key, value in test_plan.items():
    print(f"{key.replace('_', ' ').title()}: {value}")

In [None]:
# Calculate required sample size
from statsmodels.stats.proportion import proportion_effectsize
from statsmodels.stats.power import zt_ind_solve_power

p1 = test_plan['baseline_ctr']
p2 = p1 + test_plan['minimum_detectable_effect']

effect_size = proportion_effectsize(p1, p2)
n_required = zt_ind_solve_power(
    effect_size=effect_size,
    alpha=test_plan['significance_level'],
    power=test_plan['power'],
    ratio=1
)

print(f"\nRequired sample size per variant: {n_required:.0f}")
print(f"Total visitors needed: {2*n_required:.0f}")
print(f"Daily traffic required: {2*n_required/test_plan['test_duration_days']:.0f}")

**Exercise 1 – Test Planning (easy)**  
Plan an A/B test for email subject lines with 20% open rate baseline.


In [None]:
# Your turn
# Create test_plan dictionary for email campaign
# Baseline open rate: 20%
# Want to detect: 2% absolute improvement


<details>
<summary><b>Solution</b></summary>

```python
email_test_plan = {
    'test_name': 'Email Subject Line Personalization',
    'hypothesis': 'Personalized subject increases opens by 10%',
    'primary_metric': 'open_rate',
    'secondary_metrics': ['click_rate', 'unsubscribe_rate'],
    'baseline_rate': 0.20,  # 20%
    'minimum_detectable_effect': 0.02,  # 2% absolute
    'significance_level': 0.05,
    'power': 0.8
}

# Calculate sample size
p1 = email_test_plan['baseline_rate']
p2 = p1 + email_test_plan['minimum_detectable_effect']

effect = proportion_effectsize(p1, p2)
n = zt_ind_solve_power(effect, alpha=0.05, power=0.8, ratio=1)

print(f"Emails needed per variant: {n:.0f}")
print(f"Total emails: {2*n:.0f}")
print(f"\nIf you send 1000 emails/day:")
print(f"Test duration: {2*n/1000:.1f} days")
```
</details>

## 2. Simulating A/B Test Data
Creating realistic test data for our framework.

In [None]:
def generate_ab_test_data(n_users=10000, effect_size=0.1, seed=42):
    """
    Generate realistic A/B test data
    """
    np.random.seed(seed)
    
    # Create user data
    data = []
    start_date = datetime(2024, 1, 1)
    
    for i in range(n_users):
        # Random assignment to control/treatment
        variant = np.random.choice(['control', 'treatment'])
        
        # Base conversion probability
        if variant == 'control':
            p_convert = 0.05
        else:
            p_convert = 0.05 * (1 + effect_size)  # 10% lift
        
        # Generate user behavior
        converted = np.random.random() < p_convert
        
        # Add some realistic variation
        timestamp = start_date + timedelta(
            days=np.random.randint(0, 14),
            hours=np.random.randint(0, 24),
            minutes=np.random.randint(0, 60)
        )
        
        # Time on site (correlated with conversion)
        if converted:
            time_on_site = np.random.gamma(5, 2) * 60  # seconds
        else:
            time_on_site = np.random.gamma(3, 2) * 60
        
        data.append({
            'user_id': f'user_{i:05d}',
            'timestamp': timestamp,
            'variant': variant,
            'converted': converted,
            'time_on_site': time_on_site,
            'device': np.random.choice(['mobile', 'desktop', 'tablet'], 
                                     p=[0.5, 0.4, 0.1])
        })
    
    return pd.DataFrame(data)

# Generate test data
ab_data = generate_ab_test_data(n_users=10000, effect_size=0.1)
print(ab_data.head())
print(f"\nData shape: {ab_data.shape}")

In [None]:
# Quick data validation
print("Data Summary:")
print("="*50)
print(f"Total users: {len(ab_data)}")
print(f"\nVariant split:")
print(ab_data['variant'].value_counts())
print(f"\nConversion rates:")
print(ab_data.groupby('variant')['converted'].agg(['sum', 'mean']))

**Exercise 2 – Data Validation (medium)**  
Check for sample ratio mismatch and data quality issues.


In [None]:
# Your turn
# Check if the 50/50 split is statistically valid
# Check for missing data
# Check for duplicate users


<details>
<summary><b>Solution</b></summary>

```python
# Sample Ratio Mismatch (SRM) check
control_count = (ab_data['variant'] == 'control').sum()
treatment_count = (ab_data['variant'] == 'treatment').sum()
total = len(ab_data)

# Chi-square test for 50/50 split
expected = total / 2
chi2 = ((control_count - expected)**2 + (treatment_count - expected)**2) / expected
p_value = 1 - stats.chi2.cdf(chi2, df=1)

print("Sample Ratio Check:")
print(f"Control: {control_count} ({control_count/total:.1%})")
print(f"Treatment: {treatment_count} ({treatment_count/total:.1%})")
print(f"Chi-square p-value: {p_value:.4f}")

if p_value < 0.01:
    print("⚠️ WARNING: Sample ratio mismatch detected!")
else:
    print("✅ Sample ratio looks good")

# Data quality checks
print("\nData Quality:")
print(f"Missing values: {ab_data.isnull().sum().sum()}")
print(f"Duplicate users: {ab_data['user_id'].duplicated().sum()}")
print(f"Date range: {ab_data['timestamp'].min()} to {ab_data['timestamp'].max()}")
```
</details>

## 3. Building the A/B Testing Framework

In [None]:
class ABTestAnalyzer:
    """
    Complete A/B Testing Framework
    """
    def __init__(self, data, variant_col='variant', 
                 control_name='control', treatment_name='treatment'):
        self.data = data
        self.variant_col = variant_col
        self.control_name = control_name
        self.treatment_name = treatment_name
        
    def calculate_conversion_rate(self, metric='converted'):
        """Calculate conversion rates by variant"""
        results = self.data.groupby(self.variant_col)[metric].agg([
            'sum', 'count', 'mean'
        ])
        results.columns = ['conversions', 'users', 'conversion_rate']
        return results
    
    def run_significance_test(self, metric='converted', alpha=0.05):
        """Run statistical significance test"""
        control = self.data[self.data[self.variant_col] == self.control_name][metric]
        treatment = self.data[self.data[self.variant_col] == self.treatment_name][metric]
        
        # Two-proportion z-test
        n_control = len(control)
        n_treatment = len(treatment)
        p_control = control.mean()
        p_treatment = treatment.mean()
        
        # Pooled proportion
        p_pooled = (control.sum() + treatment.sum()) / (n_control + n_treatment)
        
        # Standard error
        se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_control + 1/n_treatment))
        
        # Z-score
        z_score = (p_treatment - p_control) / se
        
        # P-value (two-tailed)
        p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
        
        # Confidence interval
        ci_se = np.sqrt(p_control*(1-p_control)/n_control + 
                       p_treatment*(1-p_treatment)/n_treatment)
        ci_margin = 1.96 * ci_se
        
        lift = (p_treatment - p_control) / p_control * 100
        
        return {
            'control_rate': p_control,
            'treatment_rate': p_treatment,
            'absolute_difference': p_treatment - p_control,
            'relative_lift': lift,
            'z_score': z_score,
            'p_value': p_value,
            'significant': p_value < alpha,
            'ci_lower': (p_treatment - p_control) - ci_margin,
            'ci_upper': (p_treatment - p_control) + ci_margin
        }
    
    def plot_results(self, metric='converted'):
        """Visualize A/B test results"""
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        
        # Conversion rates
        rates = self.calculate_conversion_rate(metric)
        axes[0].bar(rates.index, rates['conversion_rate'], 
                   color=['blue', 'green'])
        axes[0].set_ylabel('Conversion Rate')
        axes[0].set_title('Conversion Rates by Variant')
        axes[0].set_ylim(0, max(rates['conversion_rate']) * 1.2)
        
        # Add values on bars
        for i, (idx, row) in enumerate(rates.iterrows()):
            axes[0].text(i, row['conversion_rate'], 
                        f"{row['conversion_rate']:.3f}", 
                        ha='center', va='bottom')
        
        # Cumulative conversion over time
        daily = self.data.groupby(
            [pd.Grouper(key='timestamp', freq='D'), self.variant_col]
        )[metric].mean().unstack(fill_value=0)
        
        daily.cumsum().plot(ax=axes[1])
        axes[1].set_xlabel('Date')
        axes[1].set_ylabel('Cumulative Conversion Rate')
        axes[1].set_title('Conversion Rate Over Time')
        axes[1].legend()
        
        # Confidence intervals
        test_results = self.run_significance_test(metric)
        diff = test_results['absolute_difference']
        ci_lower = test_results['ci_lower']
        ci_upper = test_results['ci_upper']
        
        axes[2].errorbar(1, diff, 
                        yerr=[[diff-ci_lower], [ci_upper-diff]], 
                        fmt='o', markersize=10, capsize=10)
        axes[2].axhline(y=0, color='gray', linestyle='--')
        axes[2].set_xlim(0, 2)
        axes[2].set_ylabel('Difference in Conversion Rate')
        axes[2].set_title('95% Confidence Interval')
        axes[2].set_xticks([])
        
        plt.tight_layout()
        plt.show()

# Use the framework
analyzer = ABTestAnalyzer(ab_data)
results = analyzer.run_significance_test()

print("A/B TEST RESULTS")
print("="*50)
for key, value in results.items():
    if isinstance(value, float):
        if 'rate' in key or 'difference' in key:
            print(f"{key}: {value:.4f}")
        else:
            print(f"{key}: {value:.3f}")
    else:
        print(f"{key}: {value}")

In [None]:
# Visualize results
analyzer.plot_results()

## 4. Advanced Analysis: Segmentation

In [None]:
def segment_analysis(data, segment_col='device'):
    """
    Analyze A/B test results by segment
    """
    segments = data[segment_col].unique()
    results = []
    
    for segment in segments:
        segment_data = data[data[segment_col] == segment]
        
        control = segment_data[segment_data['variant'] == 'control']['converted']
        treatment = segment_data[segment_data['variant'] == 'treatment']['converted']
        
        if len(control) > 0 and len(treatment) > 0:
            _, p_value = stats.chi2_contingency([
                [control.sum(), len(control) - control.sum()],
                [treatment.sum(), len(treatment) - treatment.sum()]
            ])[:2]
            
            results.append({
                'segment': segment,
                'control_rate': control.mean(),
                'treatment_rate': treatment.mean(),
                'lift': (treatment.mean() - control.mean()) / control.mean() * 100,
                'p_value': p_value,
                'sample_size': len(segment_data)
            })
    
    return pd.DataFrame(results).sort_values('lift', ascending=False)

# Run segment analysis
segment_results = segment_analysis(ab_data, 'device')
print("Segment Analysis - By Device:")
print(segment_results.to_string(index=False))

In [None]:
# Visualize segment results
fig, ax = plt.subplots(figsize=(10, 6))

segments = segment_results['segment']
x = np.arange(len(segments))
width = 0.35

control_rates = segment_results['control_rate']
treatment_rates = segment_results['treatment_rate']

ax.bar(x - width/2, control_rates, width, label='Control', color='blue', alpha=0.7)
ax.bar(x + width/2, treatment_rates, width, label='Treatment', color='green', alpha=0.7)

ax.set_xlabel('Device Type')
ax.set_ylabel('Conversion Rate')
ax.set_title('A/B Test Results by Device Segment')
ax.set_xticks(x)
ax.set_xticklabels(segments)
ax.legend()

# Add significance stars
for i, row in segment_results.iterrows():
    if row['p_value'] < 0.05:
        ax.text(i, max(row['control_rate'], row['treatment_rate']) + 0.002, 
               '*', ha='center', fontsize=20, color='red')

plt.tight_layout()
plt.show()

**Exercise 3 – Time-based Analysis (medium)**  
Check if the treatment effect changes over time (novelty effect).


In [None]:
# Your turn
# Analyze conversion rates by week
# Check if treatment effect diminishes over time


<details>
<summary><b>Solution</b></summary>

```python
# Add week number
ab_data['week'] = ((ab_data['timestamp'] - ab_data['timestamp'].min()).dt.days // 7) + 1

# Calculate weekly conversion rates
weekly_results = []
for week in ab_data['week'].unique():
    week_data = ab_data[ab_data['week'] == week]
    
    control = week_data[week_data['variant'] == 'control']['converted']
    treatment = week_data[week_data['variant'] == 'treatment']['converted']
    
    if len(control) > 30 and len(treatment) > 30:  # Minimum sample
        weekly_results.append({
            'week': week,
            'control_rate': control.mean(),
            'treatment_rate': treatment.mean(),
            'lift': (treatment.mean() - control.mean()) / control.mean() * 100
        })

weekly_df = pd.DataFrame(weekly_results)

# Plot novelty effect
plt.figure(figsize=(10, 6))
plt.plot(weekly_df['week'], weekly_df['control_rate'], 
         'o-', label='Control', color='blue')
plt.plot(weekly_df['week'], weekly_df['treatment_rate'], 
         'o-', label='Treatment', color='green')
plt.xlabel('Week')
plt.ylabel('Conversion Rate')
plt.title('Checking for Novelty Effect')
plt.legend()
plt.grid(True, alpha=0.3)

# Add trend line for lift
ax2 = plt.gca().twinx()
ax2.plot(weekly_df['week'], weekly_df['lift'], 
         's--', color='red', alpha=0.5, label='Lift %')
ax2.set_ylabel('Lift (%)', color='red')
ax2.tick_params(axis='y', labelcolor='red')

plt.show()

# Check if lift is decreasing
if len(weekly_df) > 1:
    correlation = weekly_df['week'].corr(weekly_df['lift'])
    print(f"Week-Lift Correlation: {correlation:.3f}")
    if correlation < -0.5:
        print("⚠️ Warning: Novelty effect detected (lift decreasing over time)")
    else:
        print("✅ No strong novelty effect detected")
```
</details>

## 5. Reporting Framework

In [None]:
def generate_ab_test_report(analyzer, data, test_name="A/B Test"):
    """
    Generate comprehensive A/B test report
    """
    results = analyzer.run_significance_test()
    rates = analyzer.calculate_conversion_rate()
    
    report = f"""
    {'='*60}
    A/B TEST REPORT: {test_name}
    {'='*60}
    
    EXECUTIVE SUMMARY
    -----------------
    Test Duration: {(data['timestamp'].max() - data['timestamp'].min()).days} days
    Total Users: {len(data):,}
    
    RESULTS
    -------
    Control Conversion Rate: {results['control_rate']:.2%}
    Treatment Conversion Rate: {results['treatment_rate']:.2%}
    Relative Lift: {results['relative_lift']:+.1f}%
    
    Statistical Significance: {'✅ YES' if results['significant'] else '❌ NO'}
    P-value: {results['p_value']:.4f}
    95% CI for difference: [{results['ci_lower']:.4f}, {results['ci_upper']:.4f}]
    
    RECOMMENDATION
    --------------
    """
    
    if results['significant']:
        if results['relative_lift'] > 0:
            report += """✅ SHIP IT! The treatment shows a statistically significant improvement.
    Implement the treatment variant to all users."""
        else:
            report += """⚠️ DO NOT SHIP! The treatment shows a statistically significant decrease.
    Keep the control variant."""
    else:
        report += """🔄 INCONCLUSIVE. No statistically significant difference detected.
    Consider:
    - Running the test longer for more data
    - Testing a more impactful change
    - Checking segment-level results"""
    
    report += f"""
    
    BUSINESS IMPACT
    ---------------
    If implemented for 100,000 users:
    - Additional conversions: {int(100000 * results['absolute_difference']):,}
    - At $50 per conversion: ${int(100000 * results['absolute_difference'] * 50):,} additional revenue
    
    {'='*60}
    """
    
    return report

# Generate report
report = generate_ab_test_report(analyzer, ab_data, "Homepage CTA Color Test")
print(report)

**Exercise 4 – Complete A/B Test (hard)**  
Run a complete A/B test analysis with your own hypothesis.


In [None]:
# Your turn
# 1. Generate data for a pricing test (old: $9.99, new: $7.99)
# 2. Run significance test
# 3. Check segments
# 4. Generate report with revenue impact


<details>
<summary><b>Solution</b></summary>

```python
# Pricing test scenario
def generate_pricing_test_data(n_users=5000):
    np.random.seed(42)
    data = []
    
    for i in range(n_users):
        variant = np.random.choice(['control_$9.99', 'treatment_$7.99'])
        
        # Lower price = higher conversion
        if variant == 'control_$9.99':
            p_convert = 0.08
            revenue_per_user = 9.99
        else:
            p_convert = 0.12  # 50% lift in conversion
            revenue_per_user = 7.99
        
        converted = np.random.random() < p_convert
        revenue = revenue_per_user if converted else 0
        
        data.append({
            'user_id': f'user_{i:05d}',
            'variant': variant,
            'converted': converted,
            'revenue': revenue,
            'user_segment': np.random.choice(['new', 'returning'], p=[0.3, 0.7])
        })
    
    return pd.DataFrame(data)

# Generate and analyze
pricing_data = generate_pricing_test_data()

# Conversion analysis
print("PRICING TEST RESULTS")
print("="*50)
conversion_by_variant = pricing_data.groupby('variant').agg({
    'converted': ['sum', 'mean'],
    'revenue': 'mean'
})
print(conversion_by_variant)

# Revenue comparison
control_revenue = pricing_data[pricing_data['variant'] == 'control_$9.99']['revenue'].mean()
treatment_revenue = pricing_data[pricing_data['variant'] == 'treatment_$7.99']['revenue'].mean()

print(f"\nAverage Revenue per User:")
print(f"Control ($9.99): ${control_revenue:.2f}")
print(f"Treatment ($7.99): ${treatment_revenue:.2f}")
print(f"Revenue Lift: {(treatment_revenue - control_revenue) / control_revenue * 100:+.1f}%")

# Statistical test on revenue
from scipy import stats
control_rev = pricing_data[pricing_data['variant'] == 'control_$9.99']['revenue']
treatment_rev = pricing_data[pricing_data['variant'] == 'treatment_$7.99']['revenue']
t_stat, p_value = stats.ttest_ind(control_rev, treatment_rev)

print(f"\nRevenue Significance Test:")
print(f"p-value: {p_value:.4f}")

# Segment analysis
print("\nSegment Analysis:")
for segment in ['new', 'returning']:
    seg_data = pricing_data[pricing_data['user_segment'] == segment]
    seg_control = seg_data[seg_data['variant'] == 'control_$9.99']['converted'].mean()
    seg_treatment = seg_data[seg_data['variant'] == 'treatment_$7.99']['converted'].mean()
    print(f"{segment.capitalize()} users: {seg_control:.1%} → {seg_treatment:.1%} "
          f"(+{(seg_treatment-seg_control)/seg_control*100:.0f}%)")

# Business recommendation
print("\n" + "="*50)
print("RECOMMENDATION:")
if treatment_revenue > control_revenue and p_value < 0.05:
    print("✅ Lower price to $7.99 - higher revenue despite lower price point!")
    print(f"Projected annual impact: ${(treatment_revenue - control_revenue) * 365 * 1000:.0f}")
elif treatment_revenue < control_revenue:
    print("❌ Keep $9.99 price - lower price reduces total revenue")
else:
    print("🔄 Inconclusive - need more data")
```
</details>

## 6. Mini-Challenges
- **M1 (easy):** Calculate minimum detectable effect for your traffic
- **M2 (medium):** Implement sequential testing (early stopping)
- **M3 (hard):** Build Bayesian A/B testing framework

In [None]:
# Your turn - try the challenges!


<details>
<summary><b>Solutions</b></summary>

```python
# M1 - Minimum Detectable Effect
from statsmodels.stats.power import zt_ind_solve_power

daily_traffic = 1000
test_days = 14
n_per_group = (daily_traffic * test_days) / 2

# What effect can we detect?
mde = zt_ind_solve_power(effect_size=None, 
                         nobs1=n_per_group,
                         alpha=0.05, 
                         power=0.8)
print(f"With {n_per_group:.0f} users per group:")
print(f"Minimum detectable effect size: {mde:.3f}")

# M2 - Sequential Testing (simplified)
def sequential_test(data, alpha=0.05, check_points=5):
    n = len(data)
    check_every = n // check_points
    
    for i in range(1, check_points + 1):
        subset = data.iloc[:i*check_every]
        control = subset[subset['variant'] == 'control']['converted']
        treatment = subset[subset['variant'] == 'treatment']['converted']
        
        if len(control) > 30 and len(treatment) > 30:
            _, p_value = stats.chi2_contingency([
                [control.sum(), len(control) - control.sum()],
                [treatment.sum(), len(treatment) - treatment.sum()]
            ])[:2]
            
            # Bonferroni correction for multiple checks
            adjusted_alpha = alpha / check_points
            
            print(f"Check {i}: n={len(subset)}, p={p_value:.4f}")
            if p_value < adjusted_alpha:
                print(f"✅ Stop early! Significant at check {i}")
                return True
    return False

sequential_test(ab_data)

# M3 - Bayesian A/B Testing
from scipy.stats import beta

def bayesian_ab_test(data, prior_alpha=1, prior_beta=1):
    control = data[data['variant'] == 'control']['converted']
    treatment = data[data['variant'] == 'treatment']['converted']
    
    # Update priors with data
    control_alpha = prior_alpha + control.sum()
    control_beta = prior_beta + len(control) - control.sum()
    
    treatment_alpha = prior_alpha + treatment.sum()
    treatment_beta = prior_beta + len(treatment) - treatment.sum()
    
    # Sample from posteriors
    n_samples = 10000
    control_samples = beta.rvs(control_alpha, control_beta, size=n_samples)
    treatment_samples = beta.rvs(treatment_alpha, treatment_beta, size=n_samples)
    
    # Probability treatment is better
    prob_treatment_better = (treatment_samples > control_samples).mean()
    
    print(f"Bayesian A/B Test Results:")
    print(f"P(Treatment > Control): {prob_treatment_better:.1%}")
    print(f"Expected Control Rate: {control_samples.mean():.3f}")
    print(f"Expected Treatment Rate: {treatment_samples.mean():.3f}")
    
    if prob_treatment_better > 0.95:
        print("✅ Strong evidence treatment is better")
    elif prob_treatment_better < 0.05:
        print("❌ Strong evidence control is better")
    else:
        print("🔄 Need more data")

bayesian_ab_test(ab_data)
```
</details>

## Wrap-Up & Next Steps
✅ You can design and plan A/B tests properly  
✅ You built a complete testing framework  
✅ You can analyze results and check for biases  
✅ You can create actionable reports for stakeholders  

**Week 4 Complete!** You now have the statistical foundation and practical framework for data-driven decision making through A/B testing.

**Next Week:** Causal Inference - Going beyond correlation to understand true cause and effect!
