# E-commerce A/B Test Analysis

This notebook analyzes the e-commerce A/B test dataset from Kaggle using our custom A/B Test Analysis Framework.

**Dataset**: https://www.kaggle.com/datasets/zhangluyuan/ab-testing

**Scenario**: An e-commerce company tested a new website design against the old design to see if it would improve conversion rates.

**Question**: Should we roll out the new design to all users?

## Setup and Data Loading

In [0]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import our A/B test framework
import sys
sys.path.append('..')  # Add parent directory to path

from ab_test_framework import (
    PowerAnalyzer,
    SignificanceTest,
    EffectSizeCalculator,
    MultipleTestingCorrection,
    ResultVisualizer
)

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("‚úì Libraries imported successfully")

In [0]:
# Load the dataset
# Download from: https://www.kaggle.com/datasets/zhangluyuan/ab-testing
# Place the 'ab_data.csv' file in the same directory as this notebook

df = pd.read_csv('ab_data.csv')

print(f"Dataset loaded: {len(df):,} rows")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

## 1. Exploratory Data Analysis

In [0]:
# Basic statistics
print("Dataset Shape:", df.shape)
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nBasic Statistics:")
df.describe()

In [0]:
# Check the distribution of groups
print("Group Distribution:")
print(df['group'].value_counts())
print("\nLanding Page Distribution:")
print(df['landing_page'].value_counts())
print("\nConversion Distribution:")
print(df['converted'].value_counts())
print(f"\nOverall Conversion Rate: {df['converted'].mean():.2%}")

In [0]:
# Check for data quality issues
print("Checking for mismatched group/landing_page combinations...\n")

# Control should have old_page, treatment should have new_page
mismatches = df[
    ((df['group'] == 'control') & (df['landing_page'] == 'new_page')) |
    ((df['group'] == 'treatment') & (df['landing_page'] == 'old_page'))
]

print(f"Found {len(mismatches):,} mismatched rows ({len(mismatches)/len(df)*100:.2f}%)")

if len(mismatches) > 0:
    print("\nRemoving mismatched rows for clean analysis...")
    df_clean = df[
        ((df['group'] == 'control') & (df['landing_page'] == 'old_page')) |
        ((df['group'] == 'treatment') & (df['landing_page'] == 'new_page'))
    ].copy()
    print(f"Clean dataset: {len(df_clean):,} rows")
else:
    df_clean = df.copy()
    print("No mismatches found - data looks good!")

In [0]:
# Check for duplicate users
duplicate_users = df_clean['user_id'].duplicated().sum()
print(f"Duplicate users: {duplicate_users:,}")

if duplicate_users > 0:
    print("\nKeeping first occurrence of each user...")
    df_clean = df_clean.drop_duplicates(subset='user_id', keep='first')
    print(f"Final dataset: {len(df_clean):,} unique users")

In [0]:
# Visualize the data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Group sizes
group_counts = df_clean['group'].value_counts()
axes[0].bar(group_counts.index, group_counts.values, color=['#2E86AB', '#A23B72'])
axes[0].set_title('Sample Size by Group', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Number of Users')
for i, v in enumerate(group_counts.values):
    axes[0].text(i, v, f'{v:,}', ha='center', va='bottom', fontweight='bold')

# Conversion rates by group
conv_rates = df_clean.groupby('group')['converted'].mean()
bars = axes[1].bar(conv_rates.index, conv_rates.values, color=['#2E86AB', '#A23B72'])
axes[1].set_title('Conversion Rate by Group', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Conversion Rate')
axes[1].set_ylim(0, max(conv_rates.values) * 1.2)
axes[1].yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.1%}'))
for i, v in enumerate(conv_rates.values):
    axes[1].text(i, v, f'{v:.2%}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nConversion Rates:")
for group, rate in conv_rates.items():
    print(f"  {group}: {rate:.2%}")

## 2. Prepare Data for Analysis

In [0]:
# Split data by group
control = df_clean[df_clean['group'] == 'control']
treatment = df_clean[df_clean['group'] == 'treatment']

# Extract key metrics
conversions_control = control['converted'].sum()
n_control = len(control)
rate_control = conversions_control / n_control

conversions_treatment = treatment['converted'].sum()
n_treatment = len(treatment)
rate_treatment = conversions_treatment / n_treatment

print("="*60)
print("EXPERIMENT SUMMARY")
print("="*60)
print(f"\nControl (Old Design):")
print(f"  Users: {n_control:,}")
print(f"  Conversions: {conversions_control:,}")
print(f"  Conversion Rate: {rate_control:.2%}")

print(f"\nTreatment (New Design):")
print(f"  Users: {n_treatment:,}")
print(f"  Conversions: {conversions_treatment:,}")
print(f"  Conversion Rate: {rate_treatment:.2%}")

print(f"\nObserved Difference:")
abs_diff = rate_treatment - rate_control
rel_diff = (rate_treatment - rate_control) / rate_control if rate_control > 0 else 0
print(f"  Absolute: {abs_diff:+.2%} ({abs_diff*100:+.2f} percentage points)")
print(f"  Relative: {rel_diff:+.1%}")

## 3. Statistical Power Analysis

Before we test for significance, let's check if this experiment had sufficient statistical power to detect a meaningful effect.

In [0]:
# Check the power of the experiment
analyzer = PowerAnalyzer(alpha=0.05)

power_result = analyzer.calculate_power(
    n_control=n_control,
    n_treatment=n_treatment,
    baseline_rate=rate_control,
    treatment_rate=rate_treatment
)

print("="*60)
print("STATISTICAL POWER ANALYSIS")
print("="*60)
print(f"\nActual Power: {power_result['power']:.1%}")
print(f"Effect Size: {power_result['effect_size']:.4f} (absolute)")
print(f"Effect Size: {power_result['effect_size_relative']:.2%} (relative)")
print(f"\nInterpretation: {power_result['interpretation']}")

if power_result['power'] < 0.8:
    print("\n‚ö†Ô∏è WARNING: Power is below the standard 80% threshold.")
    print("   This experiment may not be sensitive enough to detect the effect.")
else:
    print("\n‚úì Good! The experiment has sufficient power (‚â•80%).")

In [0]:
# What sample size would we have needed?
# Let's check for a 2% absolute lift (a common target)
sample_size = analyzer.calculate_sample_size(
    baseline_rate=rate_control,
    minimum_detectable_effect=0.02,  # 2 percentage points
    power=0.8,
    ratio=1.0
)

print("\n" + "="*60)
print("SAMPLE SIZE FOR DETECTING 2% ABSOLUTE LIFT")
print("="*60)
print(f"\nRequired per group: {sample_size['n_control']:,}")
print(f"Total required: {sample_size['total_sample_size']:,}")
print(f"\nActual sample:")
print(f"  Per group: ~{(n_control + n_treatment)//2:,}")
print(f"  Total: {n_control + n_treatment:,}")

if n_control >= sample_size['n_control']:
    print("\n‚úì We have sufficient sample size for detecting a 2% lift!")
else:
    print(f"\n‚ö†Ô∏è We need {sample_size['n_control'] - n_control:,} more users per group.")

## 4. Statistical Significance Testing

Now let's test whether the difference in conversion rates is statistically significant.

In [0]:
# Perform two-proportion z-test
tester = SignificanceTest(alpha=0.05)

result = tester.proportions_test(
    conversions_control=conversions_control,
    n_control=n_control,
    conversions_treatment=conversions_treatment,
    n_treatment=n_treatment
)

print("="*60)
print("SIGNIFICANCE TEST RESULTS")
print("="*60)
print(f"\nTest: {result['test_type']}")
print(f"\nControl Rate: {result['rates']['control']:.4%}")
print(f"Treatment Rate: {result['rates']['treatment']:.4%}")
print(f"\nZ-statistic: {result['statistic']:.4f}")
print(f"P-value: {result['p_value']:.6f}")
print(f"\nSignificant at Œ±=0.05? {result['significant']}")
print(f"\nInterpretation: {result['interpretation']}")

if result['significant']:
    print("\n‚úì SIGNIFICANT RESULT")
    print("  We can reject the null hypothesis.")
    print("  There is evidence that the new design affects conversion rate.")
else:
    print("\n‚úó NOT SIGNIFICANT")
    print("  We cannot reject the null hypothesis.")
    print("  No evidence of a difference between the designs.")

### Bootstrap Test (Non-parametric Alternative)

Let's also run a bootstrap test to verify our parametric test results.

In [0]:
# Prepare binary arrays for bootstrap
control_conversions_array = control['converted'].values
treatment_conversions_array = treatment['converted'].values

# Run bootstrap test
print("Running bootstrap test (this may take a moment)...\n")

bootstrap_result = tester.bootstrap_test(
    values_control=control_conversions_array,
    values_treatment=treatment_conversions_array,
    n_bootstrap=10000,
    statistic="proportion",
    random_seed=42
)

print("="*60)
print("BOOTSTRAP TEST RESULTS")
print("="*60)
print(f"\nTest: {bootstrap_result['test_type']}")
print(f"Bootstrap samples: {bootstrap_result['n_bootstrap']:,}")
print(f"\nObserved difference: {bootstrap_result['observed_difference']:.4f}")
print(f"P-value: {bootstrap_result['p_value']:.6f}")
print(f"\nSignificant at Œ±=0.05? {bootstrap_result['significant']}")
print(f"\nInterpretation: {bootstrap_result['interpretation']}")

# Compare results
print("\n" + "="*60)
print("COMPARISON: Parametric vs Bootstrap")
print("="*60)
print(f"Parametric p-value: {result['p_value']:.6f}")
print(f"Bootstrap p-value:  {bootstrap_result['p_value']:.6f}")
print(f"\nBoth tests agree: {result['significant'] == bootstrap_result['significant']}")

## 5. Effect Size Estimation

Statistical significance tells us *if* there's an effect. Effect sizes tell us *how large* it is and whether it's practically meaningful.

In [0]:
# Calculate effect size with confidence intervals
calc = EffectSizeCalculator(confidence_level=0.95)

# Absolute difference
abs_effect = calc.absolute_difference_ci(
    conversions_control=conversions_control,
    n_control=n_control,
    conversions_treatment=conversions_treatment,
    n_treatment=n_treatment
)

print("="*60)
print("ABSOLUTE EFFECT SIZE")
print("="*60)
print(f"\nControl rate: {abs_effect['rates']['control']:.4%}")
print(f"Treatment rate: {abs_effect['rates']['treatment']:.4%}")
print(f"\nAbsolute difference: {abs_effect['absolute_difference']:.4%}")
print(f"95% Confidence Interval: [{abs_effect['confidence_interval'][0]:.4%}, {abs_effect['confidence_interval'][1]:.4%}]")
print(f"\nInterpretation: {abs_effect['interpretation']}")

# Check if CI includes zero
ci_lower, ci_upper = abs_effect['confidence_interval']
if ci_lower <= 0 <= ci_upper:
    print("\n‚ö†Ô∏è Note: Confidence interval includes zero, suggesting no clear effect.")
else:
    print("\n‚úì Confidence interval does NOT include zero - clear directional effect.")

In [0]:
# Relative lift
rel_effect = calc.relative_lift_ci(
    conversions_control=conversions_control,
    n_control=n_control,
    conversions_treatment=conversions_treatment,
    n_treatment=n_treatment
)

print("="*60)
print("RELATIVE EFFECT SIZE")
print("="*60)
print(f"\nRelative lift: {rel_effect['relative_lift']:.2%}")
print(f"95% Confidence Interval: [{rel_effect['confidence_interval'][0]:.2%}, {rel_effect['confidence_interval'][1]:.2%}]")
print(f"\nInterpretation: {rel_effect['interpretation']}")

## 6. Visualizations

In [0]:
# Create conversion rate comparison plot
viz = ResultVisualizer()

fig = viz.plot_conversion_rates(
    rate_control=result['rates']['control'],
    rate_treatment=result['rates']['treatment'],
    ci_control=(abs_effect['confidence_interval'][0] + result['rates']['control'],
                abs_effect['confidence_interval'][1] + result['rates']['control']),
    ci_treatment=(result['rates']['treatment'],
                  result['rates']['treatment']),
    n_control=n_control,
    n_treatment=n_treatment,
    title="E-commerce Website A/B Test: Old Design vs New Design",
    save_path="ecommerce_conversion_rates.png"
)

plt.show()
print("‚úì Saved: ecommerce_conversion_rates.png")

In [0]:
# Effect size visualization
fig = viz.plot_effect_size(
    effect_size=abs_effect['absolute_difference'],
    ci_lower=abs_effect['confidence_interval'][0],
    ci_upper=abs_effect['confidence_interval'][1],
    metric_name="Absolute Difference in Conversion Rate",
    title="Effect Size: New Design vs Old Design",
    save_path="ecommerce_effect_size.png"
)

plt.show()
print("‚úì Saved: ecommerce_effect_size.png")

## 7. Business Impact Analysis

In [0]:
# Let's estimate the business impact
print("="*60)
print("BUSINESS IMPACT ANALYSIS")
print("="*60)

# Example business assumptions (adjust these for your actual business)
monthly_visitors = 500000
average_order_value = 50  # dollars

# Current conversions
current_conversions_monthly = monthly_visitors * result['rates']['control']
current_revenue_monthly = current_conversions_monthly * average_order_value

# Projected conversions with new design
new_conversions_monthly = monthly_visitors * result['rates']['treatment']
new_revenue_monthly = new_conversions_monthly * average_order_value

# Differences
additional_conversions = new_conversions_monthly - current_conversions_monthly
additional_revenue = new_revenue_monthly - current_revenue_monthly
annual_revenue_impact = additional_revenue * 12

print(f"\nAssumptions:")
print(f"  Monthly visitors: {monthly_visitors:,}")
print(f"  Average order value: ${average_order_value:.2f}")

print(f"\nCurrent (Old Design):")
print(f"  Monthly conversions: {current_conversions_monthly:,.0f}")
print(f"  Monthly revenue: ${current_revenue_monthly:,.2f}")

print(f"\nProjected (New Design):")
print(f"  Monthly conversions: {new_conversions_monthly:,.0f}")
print(f"  Monthly revenue: ${new_revenue_monthly:,.2f}")

print(f"\nImpact:")
print(f"  Additional conversions/month: {additional_conversions:+,.0f}")
print(f"  Additional revenue/month: ${additional_revenue:+,.2f}")
print(f"  Projected annual impact: ${annual_revenue_impact:+,.2f}")

if additional_revenue > 0:
    print(f"\nüí∞ Potential upside: ${annual_revenue_impact:,.2f} per year")
else:
    print(f"\n‚ö†Ô∏è Potential downside: ${abs(annual_revenue_impact):,.2f} per year")

In [0]:
# Conservative estimate using lower bound of confidence interval
conservative_rate = result['rates']['control'] + abs_effect['confidence_interval'][0]
conservative_conversions = monthly_visitors * conservative_rate
conservative_revenue = conservative_conversions * average_order_value - current_revenue_monthly
conservative_annual = conservative_revenue * 12

print("="*60)
print("CONSERVATIVE ESTIMATE (95% CI Lower Bound)")
print("="*60)
print(f"\nConservative conversion rate: {conservative_rate:.4%}")
print(f"Conservative monthly impact: ${conservative_revenue:+,.2f}")
print(f"Conservative annual impact: ${conservative_annual:+,.2f}")

if conservative_annual > 0:
    print(f"\n‚úì Even in the worst case (95% CI lower bound), we expect ${conservative_annual:,.2f}/year")
else:
    print(f"\n‚ö†Ô∏è In the worst case, we could lose ${abs(conservative_annual):,.2f}/year")

## 8. Final Recommendation

In [0]:
print("="*60)
print("FINAL RECOMMENDATION")
print("="*60)

print(f"\nüìä TEST RESULTS SUMMARY:")
print(f"  ‚Ä¢ Sample size: {n_control + n_treatment:,} users ({n_control:,} control, {n_treatment:,} treatment)")
print(f"  ‚Ä¢ Old design conversion: {result['rates']['control']:.2%}")
print(f"  ‚Ä¢ New design conversion: {result['rates']['treatment']:.2%}")
print(f"  ‚Ä¢ Absolute difference: {abs_effect['absolute_difference']:.2%} (95% CI: [{abs_effect['confidence_interval'][0]:.2%}, {abs_effect['confidence_interval'][1]:.2%}])")
print(f"  ‚Ä¢ Relative lift: {rel_effect['relative_lift']:.1%}")
print(f"  ‚Ä¢ P-value: {result['p_value']:.6f}")
print(f"  ‚Ä¢ Statistical power: {power_result['power']:.1%}")

print(f"\nüéØ DECISION CRITERIA:")
criteria = [
    ("Statistically significant (p < 0.05)", result['significant']),
    ("Sufficient power (‚â• 80%)", power_result['power'] >= 0.8),
    ("Positive effect", abs_effect['absolute_difference'] > 0),
    ("CI excludes zero", ci_lower > 0 or ci_upper < 0)
]

for criterion, met in criteria:
    status = "‚úì" if met else "‚úó"
    print(f"  {status} {criterion}")

# Make recommendation
meets_all_criteria = all(met for _, met in criteria)

print(f"\n" + "="*60)
if meets_all_criteria and abs_effect['absolute_difference'] > 0:
    print("‚úÖ RECOMMENDATION: IMPLEMENT THE NEW DESIGN")
    print("="*60)
    print(f"\nReason: The new design shows a statistically significant improvement")
    print(f"in conversion rate with {power_result['power']:.0%} statistical power.")
    print(f"\nExpected impact: ${annual_revenue_impact:+,.2f} annually")
    print(f"Conservative estimate: ${conservative_annual:+,.2f} annually")
    
elif result['significant'] and abs_effect['absolute_difference'] < 0:
    print("‚ùå RECOMMENDATION: KEEP THE OLD DESIGN")
    print("="*60)
    print(f"\nReason: The new design actually DECREASES conversion rate.")
    print(f"This difference is statistically significant.")
    print(f"\nExpected impact: ${annual_revenue_impact:,.2f} loss annually")
    
elif not result['significant'] and power_result['power'] >= 0.8:
    print("‚ÜîÔ∏è RECOMMENDATION: KEEP THE OLD DESIGN (No Clear Winner)")
    print("="*60)
    print(f"\nReason: No statistically significant difference detected.")
    print(f"The test had sufficient power ({power_result['power']:.0%}), so we can")
    print(f"be confident there's no meaningful effect to detect.")
    print(f"\nStick with the old design to avoid implementation costs.")
    
elif not result['significant'] and power_result['power'] < 0.8:
    print("‚ö†Ô∏è RECOMMENDATION: RUN A LARGER TEST")
    print("="*60)
    print(f"\nReason: No significant difference detected, but the test")
    print(f"only had {power_result['power']:.0%} power - below the 80% standard.")
    print(f"\nThis is inconclusive. We may have missed a real effect due to")
    print(f"insufficient sample size.")
    print(f"\nRecommend: Collect more data or run a longer test.")
    print(f"Need {sample_size['total_sample_size']:,} total users for 80% power.")
    
else:
    print("‚ùì RECOMMENDATION: FURTHER INVESTIGATION NEEDED")
    print("="*60)
    print(f"\nReason: Results are ambiguous. Review the data quality,")
    print(f"consider segmentation analysis, or consult with stakeholders.")

print("\n" + "="*60)

## 9. Additional Analyses (Optional)

### Segmentation Analysis

If your dataset includes additional features (device type, country, etc.), you can analyze different segments.

In [0]:
# Check what columns are available for segmentation
print("Available columns for segmentation:")
print(df_clean.columns.tolist())

# If timestamp is available, we can analyze by time period
if 'timestamp' in df_clean.columns:
    df_clean['timestamp'] = pd.to_datetime(df_clean['timestamp'])
    df_clean['date'] = df_clean['timestamp'].dt.date
    df_clean['hour'] = df_clean['timestamp'].dt.hour
    
    # Conversion by day
    daily_conv = df_clean.groupby(['date', 'group'])['converted'].mean().unstack()
    
    plt.figure(figsize=(14, 6))
    daily_conv.plot(kind='line', marker='o')
    plt.title('Daily Conversion Rates Over Time', fontsize=14, fontweight='bold')
    plt.ylabel('Conversion Rate')
    plt.xlabel('Date')
    plt.legend(['Control', 'Treatment'])
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("\nNo timestamp data available for time-based analysis.")

## Summary

This notebook demonstrated a complete A/B test analysis workflow:

1. ‚úÖ **Data loading and cleaning** - Removed mismatches and duplicates
2. ‚úÖ **Exploratory analysis** - Understood the data structure and distributions
3. ‚úÖ **Power analysis** - Evaluated whether the test had sufficient power
4. ‚úÖ **Significance testing** - Both parametric and bootstrap methods
5. ‚úÖ **Effect size estimation** - Measured practical significance with confidence intervals
6. ‚úÖ **Visualization** - Created publication-ready plots
7. ‚úÖ **Business impact** - Translated statistical results into business terms
8. ‚úÖ **Clear recommendation** - Made a data-driven decision

### Key Takeaways:
- Always check data quality before analysis
- Use both statistical significance AND practical significance
- Consider confidence intervals, not just point estimates
- Verify results with multiple methods (parametric + bootstrap)
- Translate findings into business impact