## Exercise 1 - Calculating Required Sample Size

In [3]:
from statsmodels.stats.power import NormalIndPower 
from statsmodels.stats.proportion import proportion_effectsize

# Inputs
p1 = 0.20
p2 = 0.23
alpha = 0.05
power = 0.8 

# Calculate effect size (Cohen's h for proportions)
effect_size = proportion_effectsize(p1, p2)

# Calculate sample size per group 
analysis = NormalIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1)

print(f"Required sample size per group = {int(sample_size)} users")

Required sample size per group = 2940 users


## Exercise 2 - Understanding The Relationship Between Effect Size And Sample Size

In [None]:
from statsmodels.stats.power import NormalIndPower 
from statsmodels.stats.proportion import proportion_effectsize

# Inputs
alpha = 0.05
power = 0.8 

def sample_size(p1, p2, alpha, power)->str:
    """Calculates sample size needed for AB testing for discrete metrics
      based on p1, p2, alpha and power"""

    # Calculate effect size (Cohen's h for proportions)
    effect_size = proportion_effectsize(p1, p2)

    # Calculate sample size per group
    analysis = NormalIndPower()
    sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1)
    print(f"Required sample size per group = {int(sample_size)} users\n")

# Calculate sample size for different effect sizes 

# +%2
print("Effect Size - 0.2")
sample_size(0.20, 0.22, alpha, power)

# +%4
print("Effect Size - 0.4")
sample_size(0.20, 0.24, alpha, power)

# +5%
print("Effect Size - 0.5")
sample_size(0.20, 0.25, alpha, power)

Effect Size - 0.2
Required sample size per group = 6507 users

Effect Size - 0.4
Required sample size per group = 1680 users

Effect Size - 0.5
Required sample size per group = 1091 users



**How does the sample size change as the effect size increases? Explain why this happens.**


Answer: If we want to detect a smaller effect size, we’ll need more users to be able to identify a small change in our metric. Conversely, if the expected effect size is larger, the required sample size will be smaller, since it’s easier to detect a bigger change.

## Exercise 3 - Exploring The Impact Of Statistical Power

In [14]:
from statsmodels.stats.power import NormalIndPower 
from statsmodels.stats.proportion import proportion_effectsize

# Inputs
alpha = 0.05
p1 = 0.20
p2 = 0.22 

def sample_size(p1, p2, alpha, power):
    """Calculates sample size needed for AB testing for discrete metrics
    based on p1, p2, alpha and power"""
     
    # Calculate effect size
    effect_size = proportion_effectsize(p1, p2)

    # Calculate sample size per group
    analysis = NormalIndPower()
    sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1)
    print(f"Required sample size per group = {int(sample_size)} users\n")

# Calculate required sample size for different power levels 
# 0.7
print("Power = 0.7")
sample_size(p1, p2, alpha, 0.7)
# 0.8
print("Power = 0.8")
sample_size(p1, p2, alpha, 0.8)
# 0.9
print("Power = 0.9")
sample_size(p1, p2, alpha, 0.9)


Power = 0.7
Required sample size per group = 5117 users

Power = 0.8
Required sample size per group = 6507 users

Power = 0.9
Required sample size per group = 8711 users



**Question: How does the required sample size change with different levels of statistical power? Why is this understanding important when designing A/B tests?**

If we increase the statistical power, our sample size will increase as well. 
It's important to understand this because if we don't have an enough users to run tests, we'll need to compromise on one of the components of power analysis. If we compromise on power, we need to take into consideration that the lower the statistical power the lower the chance that we'll catch a difference between control and variant if there is a difference. 
So, if we lower the power to 0.7 we need less users, but there's a 30% chance that we'll run the test and see no significant difference and wrongly conclude that variant version isn't better - even though it actually is. 


## Exercise 4 - Implementing Sequential Testing

You are running an A/B test on two versions of a product page to increase the purchase rate. You plan to monitor the results weekly and stop the test early if one version shows a significant improvement.

- Define your stopping criteria.
- Decide how you would implement sequential testing in this scenario.
- At the end of week three, Version B has a p-value of 0.02. What would you do next?

Answer: 
- Stopping criteria: One version has a p-value less than 0.05 at any checkpoint. 
- I'd monitor the test every week, for example:

-- In week 1 I'd check the conversion rate for Version A and Version B, if I see the p-value is above 0.05 (stopping criteria) or that both versions have the same value, I continue monitoring. If the p-value is below 0.05 I conclude that Version B is better and stop the test. 

-- I repeat the same step on week 2. Continuing to monitor if the p-value is above 0.05 or stop the test if I see the p-value is below 0.05. 

-- In week 3 (assuming this is the last week of the test), I check the results, if Version B has a significantly higher conversion rate than version A and the analysis gives me a p-value < 0.05 I stop the test and conclude that Version B is better, else, I stop the test anyway and I let the stakeholders know that Version B didn't perform better than Version A and we should stay with the same version or continue with a different test. 

## Exercise 5 - Applying Bayesian A/B Testing

You’re testing a new feature in your app, and you want to use a Bayesian approach. Initially, you believe the new feature has a 50% chance of improving user engagement. After collecting data, your analysis suggests a 65% probability that the new feature is better.

- Describe how you would set up your prior belief.
- After collecting data, how does the updated belief (posterior distribution) influence your decision?
- What would you do if the posterior probability was only 55%?

- I would set up my prior belief on previous tests that we ran on similar apps or market research about how similar features impacted user engagement on different apps from the same industry. 
- Even if it shows a 15% increase, I think it would be more beneficial to do more testing and come to a more conclusive result before proceding with either decision.
- I'd advice to continue testing to get a more conclusive result. 


## Exercise 6 - Implementing Adaptive Experimentation

You’re running a test with three different website layouts to increase user engagement. Initially, each layout gets 33% of the traffic. After the first week, Layout C shows higher engagement.

- Explain how you would adjust the traffic allocation after the first week.
- Describe how you would continue to adapt the experiment in the following weeks.
- What challenges might you face with adaptive experimentation, and how would you address them?

- After the first week I would put 40% of traffic on Layout C and the rest divide it half/half between Layout A and B
- If in the following week I see that Layout C is still performing better than A and B I would adjust 50% of the traffic to the best performance layout (C) and leave the rest half half on A and B, if on following weeks it continues to overperform we can keep increassing the traffic to layout C and decide to implement based on what our stakeholders think.
- There are a few challenges that can come with adaptive experimentation. There might be bias in estimates as we're not comparing apples to apples when we shift users from one variant to the other, it's harder to calculate confidence intervals or do proper statistical tests at the end as the sample sizes and allocations of the variants are unequal and changing. This can be addressed by:

- Deciding in advance when and how you'll adapt. 
- Make sure to split traffic equally among variations at the beginning (including bad ones) so you learn slowly. 
- Keep a small percentage of traffic (5-10%) allocated equally to all variations throughout the test as it gives an unbiased estimate of true effects. 
- Use bayesian methods. 
- If the stakes of implementing the "winner" variant are high, a quick traditional A/B test should be run on the winner vs control with fresh data. This confirms findings with clean statistics. 