# Understanding A/B Testing
A/B testing (also called split testing) is an experiment where you compare two versions of something (e.g., a website, an email, an ad) to determine which one performs better. The goal is to make data-driven decisions.

##### References
1. https://www.abtasty.com/resources/ab-testing/#:~:text=A%2FB%20testing%2C%20also%20known,the%20rest%20to%20the%20second.
2. https://towardsdatascience.com/a-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499/
3. https://www.invespcro.com/blog/ab-testing-statistics-made-simple/
4. https://www.datacamp.com/blog/data-demystified-what-is-a-b-testing
5. https://github.com/FrancescoCasalegno/AB-Testing/blob/main/AB_Testing.ipynb [A/B Testing in Python]

## 1. Key Components of A/B Testing
### A. Control Group vs. Treatment Group
- **Control Group (A):** This group gets the current version.
- **Treatment Group (B):** This group gets the new version.
- Users are randomly assigned to either group and performance metrics are collected.

### B. Key Metrics to Measure
- **Conversion Rate (CR):** Percentage of users who complete a desired action.
- **Click-Through Rate (CTR):** Percentage of users who click a link or button.
- **Revenue per Visitor (RPV):** Revenue generated per visitor.

**Formula for Conversion Rate (CR):**
```r
CR <- function(conversions, visitors) {
  return (conversions / visitors)
}
CR(120, 1000)  # Example: 120/1000 = 12%
```

## 2. Important Considerations for a Good A/B Test
### A. Randomization
- Users must be randomly assigned to either group to prevent bias.

### B. Sample Size & Statistical Power
- Ensure a large enough sample size to detect meaningful differences.
- Common significance level (α) and power:
  - α = 0.05 (5% risk of false positives)
  - Power = 0.8 (80% chance of detecting real differences)

### C. Statistical Significance (p-value)
- **Null Hypothesis (H₀):** "No difference between A and B."
- **Alternative Hypothesis (H₁):** "There is a difference."
- A p-value < 0.05 means we reject H₀.

### D. Type I & Type II Errors
- **Type I Error (False Positive):** Wrongly detecting a difference.
- **Type II Error (False Negative):** Failing to detect a real difference.

## 3. Statistical Methods Used in A/B Testing
### A. Two-Sample t-Test
**Definition:** The t-test is used to compare the means of two groups to determine if there is a statistically significant difference between them. 
- The two-sample t-test (also called an independent t-test) is used to compare the means of two groups. In A/B testing, it helps determine if a numerical metric (like average time spent on a page, revenue per user, etc.) is significantly different between two variations.

**Assumptions**
- Independence – The samples must be independent.
- Normality – The data should be normally distributed (if sample size is large, the Central Limit Theorem applies).
- Equal Variances (for standard t-test) – If variances are unequal, we use Welch’s t-test.
    
**Pros:**
- Simple and widely used.
- Works well with normally distributed data.

**Cons:**
- Assumes normality and equal variance in groups.
- Less effective for small sample sizes.

**Best Scenario to Use:**
- When comparing continuous numerical data, such as average revenue per user or time spent on a webpage.

```r
# Simulated A/B test data
a <- rbinom(1000, 1, 0.13)  # 13% conversion rate
b <- rbinom(1000, 1, 0.15)  # 15% conversion rate

# Perform t-test
t_test <- t.test(a, b)
print(t_test)
```

**When to Use a t-Test?**
- Comparing average spending per user.
- Comparing time spent on a webpage between two versions.
- Comparing average order value.
    
### B. Chi-Square Test
**Definition:** The chi-square test is used to determine if there is a significant association between categorical variables, often used for comparing proportions.
- The Chi-Square test is used to compare proportions between two groups. In A/B testing, it is commonly used to compare conversion rates (e.g., how many users clicked a button or made a purchase).

**Assumptions**
- Data is categorical (e.g., converted vs. not converted).
- Expected frequency in each cell is at least 5 for reliable results.                                                                                                                                          
                                                                                                                                          
**Pros:**
- Works well for categorical data (e.g., conversion rates, user actions).
- Does not assume a normal distribution.

**Cons:**
- Requires sufficiently large sample sizes.
- Not ideal for very small data samples due to unstable results.

**Best Scenario to Use:**
- When comparing conversion rates between groups (e.g., click-through rates for A vs. B).

```r
# Simulated conversion data
observed <- matrix(c(200, 800, 250, 750), nrow=2, byrow=TRUE)
colnames(observed) <- c("Converted", "Not Converted")
rownames(observed) <- c("A", "B")

# Perform chi-square test
chi_test <- chisq.test(observed)
print(chi_test)
```
**When to Use a Chi-Square Test?**
- Comparing click-through rates.
- Comparing signup rates between two versions.
- Checking if conversion rates differ between A and B.

### C. Bayesian A/B Testing
**Definition:** Bayesian A/B testing uses probability distributions (Beta distributions) to model conversion rates and determine the probability of one variation being better than another.
- Instead of a binary "reject/fail to reject" decision like t-tests, Bayesian A/B testing provides the probability that one variant is better than the other.

**How Does it Work?**
- We model the conversion rate as a Beta distribution.
- We update beliefs with observed data using Bayes' Theorem.
- We compute the probability that B is better than A.
    
**Pros:**
- Provides a probability of A being better than B rather than a binary yes/no decision.
- More intuitive for decision-making.

**Cons:**
- More computationally intensive.
- Requires careful choice of priors.

**Best Scenario to Use:**
- When decision-makers prefer a probability-based approach instead of relying on p-values.
- When data is limited, and prior knowledge can help improve estimates.

```r
# Install required package
install.packages("bayesAB")
library(bayesAB)

# Simulated Bayesian A/B Test
a_success <- 200
a_trials <- 1000
b_success <- 250
b_trials <- 1000

bayes_result <- bayesTest(
  a_success, a_trials,
  b_success, b_trials,
  priors = c("alpha" = 1, "beta" = 1),
  n_samples = 50000
)

print(summary(bayes_result))
plot(bayes_result)
```
**When to Use Bayesian A/B Testing?**
- When you need a probability estimate instead of a binary decision.
- When sample sizes are small.
- When decision-makers prefer a risk-based approach.
                      
## 4. Conclusion
- **t-tests** and **chi-square tests** are commonly used.
- **Bayesian methods** provide probability-based insights.
- **Randomization** ensures fairness.
- **Adequate sample size** prevents misleading results.
- **Stopping tests too early** can lead to incorrect conclusions.

This R Markdown file serves as a complete documentation for A/B testing principles, methods, and practical implementations. 🚀


# A/B Testing Cheat Sheet

## 1. Selecting Metrics for Experimentation
Choosing the right metrics ensures meaningful results. Consider:
- **Primary Metric**: The main success indicator (e.g., conversion rate, revenue per user).
- **Secondary Metrics**: Additional insights (e.g., bounce rate, average session duration).
- **Leading Indicators**: Early signals that correlate with long-term impact.
- **Guardrail Metrics**: Ensures the experiment does not negatively impact other areas (e.g., page load time, retention rate).

**Best Practices:**
- Align metrics with business goals.
- Use metrics that are directly impacted by the test.
- Avoid vanity metrics (e.g., total page views if engagement matters more).

---

## 2. Selecting Randomization Units
Randomization ensures fairness in group assignment. Common units include:
- **User-Level Randomization** (e.g., individual visitors, accounts): Ideal for personalized experiences.
- **Session-Level Randomization**: Good for short-lived experiments but may introduce bias from repeat users.
- **Device-Level Randomization**: Used when the user experience differs across devices.
- **Geographic/Regional Randomization**: When running country-specific experiments.
- **Cookie-Based Randomization**: Ensures consistent treatment per browser session.

**Best Practices:**
- Ensure each unit has an equal chance of being assigned.
- Minimize cross-group contamination.
- Consider user behavior (e.g., multi-device users may need user-level randomization).

---

## 3. Choosing a Target Population
Defining your target audience ensures valid, actionable results.
- **General Population:** Testing on all users to understand broad impact.
- **New Users:** Measuring the effect on first-time visitors.
- **Returning Users:** Understanding impact on loyal customers.
- **High-Intent Users:** Users already close to conversion.
- **Geographic Segments:** Comparing behavior across different regions.

**Best Practices:**
- Ensure the sample is representative of your actual audience.
- Exclude anomalies (e.g., internal users, bot traffic).
- Consider external factors (e.g., seasonality, market trends).

---

## 4. Computing Sample Size
A proper sample size ensures statistical reliability.

**Key Inputs:**
- **Baseline Conversion Rate (p₀)**: The current conversion rate.
- **Minimum Detectable Effect (MDE)**: The smallest effect worth detecting.
- **Statistical Power (1-β)**: Usually set at 80%.
- **Significance Level (α)**: Typically set at 5% (p < 0.05).

**Formula for Sample Size Per Group:**
```r
n <- function(p0, MDE, alpha=0.05, power=0.8) {
  z_alpha <- qnorm(1 - alpha / 2)
  z_beta <- qnorm(power)
  return ((2 * p0 * (1 - p0) * (z_alpha + z_beta)^2) / MDE^2)
}
n(0.1, 0.02) # Example calculation
```

**Best Practices:**
- Use online calculators (e.g., Optimizely, Evan Miller's calculator).
- Ensure the test has enough power to detect real changes.
- If results are inconclusive, increase sample size rather than reducing MDE.

---

## 5. Determining Test Duration
The test duration depends on sample size, traffic, and expected conversion rates.

**Steps to Determine Duration:**
1. **Estimate daily visitors** (per variant).
2. **Compute required sample size** (from previous step).
3. **Divide sample size by daily traffic** to get the test duration.

**Best Practices:**
- Run tests for at least **1-2 full business cycles** (to capture weekly patterns).
- Avoid stopping tests too early based on interim results.
- Use a sequential analysis approach for early stopping when necessary.

---

## 6. Analyzing Results
Once data is collected, statistical analysis determines if the observed differences are significant.

**Common Tests:**
- **Two-Sample t-Test**: Compares means (e.g., revenue per user).
- **Chi-Square Test**: Compares categorical data (e.g., conversion rates).
- **Bayesian Analysis**: Provides probability estimates rather than binary outcomes.

**Key Concepts:**
- **P-Value < 0.05**: Suggests statistical significance.
- **Confidence Interval (CI)**: Helps understand the range of true effects.
- **Lift Calculation**: Measures improvement between A and B.
```r
lift <- function(rate_A, rate_B) {
  return ((rate_B - rate_A) / rate_A * 100)
}
lift(0.1, 0.12) # Example calculation
```

**Best Practices:**
- Ensure results remain stable before making decisions.
- Segment results (e.g., by device, region) to uncover hidden trends.
- Validate with a post-test analysis (e.g., long-term impact, retention).

---

## 7. Alternatives to A/B Testing
If A/B testing is not feasible, consider alternative methods:
- **Multi-Armed Bandit (MAB)**: Adapts allocation dynamically to maximize performance.
- **Sequential Testing**: Allows early stopping if results are conclusive.
- **Synthetic Control Method**: Used when randomization is impractical.
- **Difference-in-Differences (DID)**: Compares trends over time between test and control groups.
- **Pre-Post Analysis**: Compares metrics before and after changes (used when A/B testing isn't possible).

**Best Practices:**
- Use MAB when time-sensitive optimizations are needed.
- Use DID for observational data where A/B testing is infeasible.
- Pre-Post works for **major redesigns** but is less reliable than randomized experiments.

---

## Final Thoughts
A/B testing is a powerful tool when done correctly. Following structured methodologies ensures reliable, data-driven decisions. Always validate results, and consider alternatives when necessary.

---
# A/B Testing Interview Cheat Sheet

## 1. Fundamentals of A/B Testing
**Q: What is A/B testing?**
- A/B testing (split testing) is a controlled experiment comparing two versions (A and B) to determine which performs better based on predefined metrics.

**Q: Why is A/B testing important?**
- Helps in data-driven decision-making.
- Reduces guesswork in product and marketing changes.
- Measures the impact of changes before rolling them out widely.

**Q: What are some common use cases?**
- Optimizing website design (CTA buttons, landing pages).
- Testing different email subject lines for engagement.
- Comparing product pricing strategies.
- Improving user experience in apps.

---

## 2. Designing an A/B Test
**Q: How do you select metrics for an A/B test?**
- **Primary Metric**: The main success indicator (e.g., conversion rate, revenue per user).
- **Secondary Metrics**: Additional insights (e.g., session duration, bounce rate).
- **Guardrail Metrics**: Ensures no unintended negative impact (e.g., page load speed).

**Q: How do you select randomization units?**
- **User-level**: Most common, ensures users experience only one variant.
- **Session-level**: Used when the test is short-lived.
- **Device-level**: When experiments need to be device-consistent.
- **Geographic-level**: Used for regional campaigns.

**Q: What factors influence the target population?**
- New vs. returning users.
- High-intent vs. casual visitors.
- Geographic or demographic segments.

---

## 3. Statistical Concepts in A/B Testing
**Q: What is statistical significance?**
- Measures whether observed differences are due to chance.
- A **p-value < 0.05** typically means results are statistically significant.

**Q: What is the difference between Type I and Type II errors?**
- **Type I Error (False Positive):** Incorrectly rejecting the null hypothesis.
- **Type II Error (False Negative):** Failing to detect a real difference.

**Q: How do you calculate sample size for an A/B test?**
- Based on **baseline conversion rate**, **minimum detectable effect (MDE)**, **statistical power (80%)**, and **significance level (5%)**.
- Formula:
  ```r
  n <- function(p0, MDE, alpha=0.05, power=0.8) {
    z_alpha <- qnorm(1 - alpha / 2)
    z_beta <- qnorm(power)
    return ((2 * p0 * (1 - p0) * (z_alpha + z_beta)^2) / MDE^2)
  }
  n(0.1, 0.02) # Example calculation
  ```

**Q: What is the lift in A/B testing?**
- **Lift** measures the percentage increase in the metric between A and B.
  ```r
  lift <- function(rate_A, rate_B) {
    return ((rate_B - rate_A) / rate_A * 100)
  }
  lift(0.1, 0.12) # Example calculation
  ```

---

## 4. Running and Analyzing an A/B Test
**Q: How long should an A/B test run?**
- Run until **statistical significance is reached** or **a full business cycle** is completed.
- Consider seasonality and traffic fluctuations.

**Q: How do you analyze A/B test results?**
- **Two-Sample t-Test**: Compares means (e.g., revenue per user).
- **Chi-Square Test**: Compares categorical data (e.g., conversion rates).
- **Bayesian Analysis**: Provides probability estimates instead of p-values.

**Q: What do you do if your A/B test results are inconclusive?**
- Increase sample size.
- Test a larger effect size.
- Analyze different segments.
- Consider external factors (e.g., seasonality, traffic sources).

---

## 5. Challenges and Pitfalls
**Q: What are some common mistakes in A/B testing?**
- **Stopping the test too early**: Leads to misleading conclusions.
- **Multiple testing problem**: Running many tests increases false positives.
- **Ignoring interaction effects**: Some tests impact others (e.g., pricing vs. UI changes).
- **Poor randomization**: Biased test groups can invalidate results.

**Q: How do you handle low sample sizes?**
- Use **Bayesian methods** instead of frequentist approaches.
- Run tests on **high-traffic pages** for faster results.
- Consider **Multi-Armed Bandits** to dynamically allocate traffic.

---

## 6. Alternatives to A/B Testing
**Q: What can you do if an A/B test isn’t feasible?**
- **Multi-Armed Bandit (MAB):** Dynamically allocates traffic to better-performing variants.
- **Sequential Testing:** Allows stopping tests early based on statistical thresholds.
- **Difference-in-Differences (DID):** Compares trends over time instead of random assignment.
- **Synthetic Control Method:** Used when true randomization is not possible.

---

## 7. Final Tips for Acing A/B Testing Interviews
- Be clear on **business objectives** before discussing test setup.
- Explain **why a test is needed** and **how success is measured**.
- Show an understanding of **both statistical and practical significance**.
- Be ready to discuss **edge cases** (e.g., how to handle bot traffic, seasonality, or experiment conflicts).
- Mention **real-world challenges** and how you’d address them.


In [9]:
##### Simple Example using the two-smaple t test

# Import necessary libraries
import numpy as np
from scipy import stats

# Simulate conversion data
np.random.seed(42)
conversions_A = np.random.binomial(1, 0.13, 1000)  # 13% conversion rate
conversions_B = np.random.binomial(1, 0.15, 1000)  # 15% conversion rate

# Calculate conversion rates
conversion_rate_A = conversions_A.mean()
conversion_rate_B = conversions_B.mean()
print(f'Conversion Rate A: {conversion_rate_A:.2%}')
print(f'Conversion Rate B: {conversion_rate_B:.2%}')

# Perform a two-sample t-test
t_stat, p_value = stats.ttest_ind(conversions_A, conversions_B)
print(f'T-statistic: {t_stat:.4f}')
print(f'P-value: {p_value:.4f}')

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print('Reject the null hypothesis: There is a significant difference between the two versions.')
else:
    print('Fail to reject the null hypothesis: No significant difference between the two versions.')


Conversion Rate A: 13.40%
Conversion Rate B: 15.20%
T-statistic: -1.1495
P-value: 0.2505
Fail to reject the null hypothesis: No significant difference between the two versions.


In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import pymc3 as pm
import matplotlib.pyplot as plt
import requests
from io import StringIO

# Load real-world external dataset (example: marketing campaign data)
url = "https://raw.githubusercontent.com/anshooarora/AB_Testing_Dataset/master/ab_data.csv"
response = requests.get(url)
data = pd.read_csv(StringIO(response.text))

# Preprocess the data
data = data[data['group'].isin(['control', 'treatment'])]
data['converted'] = data['converted'].astype(int)

# Split into groups
control = data[data['group'] == 'control']['converted']
treatment = data[data['group'] == 'treatment']['converted']

# Perform Two-Sample t-Test
def perform_t_test(control, treatment):
    t_stat, p_value = stats.ttest_ind(control, treatment)
    print("\nTwo-Sample t-Test Results:")
    print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_value:.4f}")
    if p_value < 0.05:
        print("Statistically significant difference detected!")
    else:
        print("No statistically significant difference detected.")

perform_t_test(control, treatment)

# Perform Chi-Square Test
def perform_chi_square_test(control, treatment):
    obs = np.array([[sum(control), len(control) - sum(control)],
                    [sum(treatment), len(treatment) - sum(treatment)]])
    chi2, p, dof, expected = stats.chi2_contingency(obs)
    print("\nChi-Square Test Results:")
    print(f"Chi-Square Statistic: {chi2:.4f}, P-Value: {p:.4f}")
    if p < 0.05:
        print("Statistically significant difference detected!")
    else:
        print("No statistically significant difference detected.")

perform_chi_square_test(control, treatment)

# Perform Bayesian A/B Testing
def perform_bayesian_ab_test(success_A, trials_A, success_B, trials_B):
    with pm.Model():
        p_A = pm.Beta("p_A", alpha=1, beta=1)
        p_B = pm.Beta("p_B", alpha=1, beta=1)
        
        obs_A = pm.Binomial("obs_A", n=trials_A, p=p_A, observed=success_A)
        obs_B = pm.Binomial("obs_B", n=trials_B, p=p_B, observed=success_B)
        
        trace = pm.sample(2000, return_inferencedata=True, progressbar=False)
    
    pm.plot_posterior(trace, var_names=["p_A", "p_B"])
    plt.show()

print("\nRunning Bayesian A/B Testing...")
perform_bayesian_ab_test(sum(control), len(control), sum(treatment), len(treatment))
