# AB Tests with Python "free trial" Screener
**Experiment Name**: "Free Trial" Screener.

In [None]:
import math as mt
import numpy as np
import pandas as pd
from scipy.stats import norm

## Current Setup

- Udacity course pages currently show two options: **“Start free trial”** and **“Access course materials.”**
  - **Start free trial:** Students enter credit card info and get a 14-day trial of the paid version. They’re charged automatically after 14 days unless they cancel.  
  - **Access course materials:** Students can view videos and take quizzes for free, but don’t get coaching, feedback, or a certificate.

## The Experiment

- Udacity tested adding a **time commitment question** after clicking “Start free trial.”  
  - If a student said they could spend **5+ hours/week**, they continued to checkout as usual.  
  - If they said **less than 5 hours/week**, they saw a message explaining that Udacity courses usually need more time and suggesting they might prefer the free materials option.  
  - Students could then **either continue with the free trial** or **choose the free materials**.

## Hypothesis

The change would help set realistic expectations and reduce cancellations from students who didn’t have enough time—without significantly lowering the number of paying or completing students.  
This could improve student satisfaction and let coaches focus on learners likely to finish.

## Experiment Details

- **Unit of diversion:** cookie  
- If a student enrolls in a free trial, tracking switches to **user ID** (a user can’t start multiple free trials).  
- Users who don’t enroll aren’t tracked by user ID, even if they were signed in.


## Metric Choice

A successful experiment needs two types of metrics: **Invariant** and **Evaluation** metrics.

- **Invariant metrics**: Should *not* be affected by the experiment. They help verify that randomization worked and groups are comparable.
- **Evaluation metrics**: Measure the impact of the experiment and relate directly to business goals.

Each metric includes a **$D_{min}$** — the minimum meaningful change for the business.  
(Example: a retention increase below 2% may be statistically significant but not practically useful.)

### Invariant Metrics – Sanity Checks

| Metric Name | Formula | $D_{min}$ | Notation |
|--------------|----------|------------|-----------|
| Cookies on course overview page | # unique daily cookies on page | 3000 cookies | $C_k$ |
| Clicks on Free Trial button | # unique daily cookies who clicked | 240 clicks | $C_l$ |
| Free Trial Click-Through Probability | $\frac{C_l}{C_k}$ | 0.01 | $CTP$ |

### Evaluation Metrics – Performance Indicators

| Metric Name | Formula | $D_{min}$ | Notation |
|--------------|----------|------------|-----------|
| **Gross Conversion** | $\frac{enrolled}{C_l}$ | 0.01 | $Conversion_{Gross}$ |
| **Retention** | $\frac{paid}{enrolled}$ | 0.01 | $Retention$ |
| **Net Conversion** | $\frac{paid}{C_l}$ | 0.0075 | $Conversion_{Net}$ |


## Estimating the baseline values of metrics 

Before we start our experiment we should know how these metrics behave before the change - that is, what are their baseline values.

### Collecting estimators data 
|Item|Description|Estimator|
|--|--|--|
|Number of cookies|	Daily unique cookies to view course overview page|	40,000|
|Number of clicks|	Daily unique cookies to click Free Trial button	|3,200|
|Number of enrollments	|Free Trial enrollments per day|	660|
|CTP	|CTP on Free Trial button	|0.08|
|Gross Conversion	|Probability of enrolling, given a click	|0.20625|
|Retention	|Probability of payment, given enrollment	|0.53|
|Net Conversion	|Probability of payment, given click	|0.109313|

In [None]:
# Store the baseline estimators in a dictionary for easy access
baseline = {
    "Cookies": 40_000,
    "Clicks": 3_200,
    "Enrollments": 660,
    "CTP": 0.08,
    "GConversion": 0.20625,
    "Retention": 0.53,
    "NConversion": 0.109313
}


### Estimating Standard Deviation

After collecting metric estimates, we calculate the **standard deviation** to use in **sample size** and **confidence interval** calculations.  
A higher variance means it’s harder to detect a statistically significant effect.

Assuming **5,000 cookies** visit the course overview page per day (as stated in the project instructions), we’ll estimate the standard deviation for the **evaluation metrics only**.  
This sample size is smaller than the total population but large enough to form two comparison groups.

### Scaling Collected Data

Before calculating variance, we need to **scale** our collected metric counts to match the sample size used for variance estimation.  
In this case, we scale from **40,000 unique cookies per day** (original data) down to **5,000 cookies per day**.


In [None]:
# Scale counts from 40,000 cookies to 5,000 cookies
scale_factor = 5_000 / baseline["Cookies"]

baseline["Cookies"] = 5_000
baseline["Clicks"] = baseline["Clicks"] * scale_factor
baseline["Enrollments"] = baseline["Enrollments"] * scale_factor

baseline


### Estimating Analytically

To estimate variance analytically, we assume metrics that represent probabilities $(\hat{p})$ follow a **binomial distribution**.  
The standard deviation can then be calculated using:

$$
SD = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
$$

This assumption holds only when the **unit of diversion** (how users are split) matches the **unit of analysis** (the denominator in the metric formula).  

### What Happens if Units Differ?

Suppose the **unit of diversion** is a user ID, but a single user may click multiple times:

- User A clicks 3 times → potentially 3 “trials” in the denominator.  
- User B clicks once → 1 trial.  

Outcomes from the same user are **correlated** (if User A doesn’t enroll once, they probably won’t enroll on other clicks).

Mathematically:

- **Independent Bernoulli trials:**  
$
Var(X) = n p (1-p)
$

- **Correlated trials:**  
$
Var(X) > n p (1-p)
$ (assuming positive correlation between all users)

**Why?** Because repeated measures from the same user aren’t adding as much independent information. The effective sample size is smaller than the count of trials. The actual variance may vary, and it’s better to estimate it **empirically** using collected data.

**Gross Conversion** – The baseline represents the probability of enrollment given a click.

> In this case, the **unit of diversion** (cookies) — how users are split between control and experiment — is the same as the **unit of analysis** (cookies who click), which is the denominator in the Gross Conversion formula.  

Because these units match, the **analytical estimate of variance** is valid and sufficient.


In [None]:
# Compute parameters for Gross Conversion (GC)
GC = {}

# Minimum detectable effect (if needed later)
GC["d_min"] = 0.01

# Probability of conversion (given or calculated as enrollments/clicks)
GC["p"] = baseline["GConversion"]

# Number of trials (clicks)
GC["n"] = baseline["Clicks"]

# Standard deviation for a proportion: sqrt(p * (1 - p) / n), rounded to 4 decimals
GC["sd"] = round(mt.sqrt(GC["p"] * (1 - GC["p"]) / GC["n"]), 4)

GC


**Retention** - The baseline is the probability of payment, given enrollment. The sample size is the number of enrolled users. 

>In this case, unit of diversion is not equal to unit of analysis (users who enrolled) so an analytical estimation is not enough 

If we had the data for these estimates, we would want to estimate this variance empirically as well.

In [None]:
# Compute parameters for Retention (R)
R = {}

# Minimum detectable effect
R["d_min"] = 0.01

# Probability of retention
R["p"] = baseline["Retention"]

# Number of trials (enrollments)
R["n"] = baseline["Enrollments"]

# Standard deviation for a proportion: sqrt(p * (1 - p) / n), rounded to 4 decimals
R["sd"] = round(mt.sqrt(R["p"] * (1 - R["p"]) / R["n"]), 4)

R


**Net Conversion** - The baseline is the probability of payment, given a click. The sample size is the number of cookies that clicked. 
> In this case, the unit of analysis and diversion are equal so we expect a good enough estimation analytically.


In [None]:
# Compute parameters for Net Conversion (NC)
NC = {}

# Minimum detectable effect
NC["d_min"] = 0.0075

# Probability of net conversion
NC["p"] = baseline["NConversion"]

# Number of trials (clicks)
NC["n"] = baseline["Clicks"]

# Standard deviation for a proportion: sqrt(p * (1 - p) / n), rounded to 4 decimals
NC["sd"] = round(mt.sqrt(NC["p"] * (1 - NC["p"]) / NC["n"]), 4)

NC


# Experiment Sizing

Given  $\alpha = 0.05$ (significance level) and  $\beta = 0.2$ (power), we want to estimate how many total pageviews (cookies who viewed the course overview page) are needed in the experiment. This total will be divided into the two groups: control and experiment.

The minimum sample size for control and experiment groups, which provides a probability of **Type I Error** $\alpha$, **Power** $1−\beta$, **detectable effect** $d$, and **baseline conversion rate** $p$ (simple hypothesis)  

$$
H_0: P_{cont} - P_{exp} = 0
$$

against the simple alternative  

$$
H_A: P_{cont} - P_{exp} = d
$$

is:

$$
n = \frac{\left(Z_{1-\frac{\alpha}{2}} sd_1 + Z_{1-\beta} sd_2 \right)^2}{d^2}
$$

where

$$
sd_1 = \sqrt{2 p (1-p)}, \quad sd_2 = \sqrt{p (1-p) + (p+d)(1-(p+d))}
$$


# Sample Size Formula for Two-Sample Proportions

Suppose we want to compare two proportions:

- Control group proportion: $p_1 = p$  
- Experiment group proportion: $p_2 = p + d$  
- Detectable difference: $d = p_2 - p_1$  

We aim for:

- Significance level: $\alpha$ (Type I error)  
- Power: $1 - \beta$ (probability of detecting a true effect)  

---

## 1. Test Statistic

For a two-sided z-test comparing proportions, the test statistic is:

$$
Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\frac{p_1 (1-p_1)}{n} + \frac{p_2 (1-p_2)}{n}}}
$$

where $n$ is the sample size per group.  

- **Under $H_0$**: $p_1 = p_2 = p$

$$
SD_0 = \sqrt{\frac{p(1-p)}{n} + \frac{p(1-p)}{n}} = \sqrt{\frac{2 p(1-p)}{n}}
$$

- **Under $H_A$**: $p_1 = p, p_2 = p+d$

$$
SD_A = \sqrt{\frac{p(1-p) + (p+d)(1-(p+d))}{n}}
$$

---

## 2. Critical Values for Significance and Power

- Two-sided test significance $\alpha$:

$$
Z_{\text{crit}} = Z_{1-\frac{\alpha}{2}}
$$

- To achieve power $1-\beta$, we require:

$$
P(\text{Reject } H_0 \mid H_A) = 1 - \beta
$$

---

## 3. Relating Detectable Difference to z-Scores

Let $X = \hat{p}_1 - \hat{p}_2$. Under $H_A$, 

$$
X \sim N(d, SD_A^2)
$$

Standardizing:

$$
P\left( \frac{X - d}{SD_A} > \frac{Z_{\text{crit}} SD_0 - d}{SD_A} \right) = 1 - \beta
$$

By the definition of the standard normal quantile:

$$
\frac{Z_{\text{crit}} SD_0 - d}{SD_A} = -Z_{1-\beta}
$$

Rearranging gives:

$$
d = Z_{1-\frac{\alpha}{2}} SD_0 + Z_{1-\beta} SD_A
$$

Substitute $SD_0$ and $SD_A$:

$$
d = Z_{1-\frac{\alpha}{2}} \sqrt{\frac{2 p(1-p)}{n}} + Z_{1-\beta} \sqrt{\frac{p(1-p) + (p+d)(1-(p+d))}{n}}
$$

Factor out $1/\sqrt{n}$:

$$
d = \frac{1}{\sqrt{n}} \Bigg( Z_{1-\frac{\alpha}{2}} \sqrt{2 p(1-p)} + Z_{1-\beta} \sqrt{p(1-p) + (p+d)(1-(p+d))} \Bigg)
$$

---

## 4. Solve for Sample Size $n$

$$
\sqrt{n} = \frac{Z_{1-\frac{\alpha}{2}} \sqrt{2 p(1-p)} + Z_{1-\beta} \sqrt{p(1-p) + (p+d)(1-(p+d))}}{d}
$$

$$
\boxed{
n = \frac{\Big( Z_{1-\frac{\alpha}{2}} \sqrt{2 p(1-p)} + Z_{1-\beta} \sqrt{p(1-p) + (p+d)(1-(p+d))} \Big)^2}{d^2}
}
$$

This gives the **required sample size per group** to detect a difference $d$ with significance level $\alpha$ and power $1-\beta$.


Regarding inputs, we have all the data we need: Type 1 error $(\alpha)$, power $(1−\beta)$, detectable change $(d=Dmin)$ and baseline conversion rate, our $\hat{p}$. What we need to calculate:

- Get Z score for $1−\frac{α}{2}$ and for $1−\beta$
 
- Get standard deviations 1 & 2, that is for both the baseline and for expected changed rate.

In [None]:
from scipy.stats import norm
import math as mt

# Function to get z-score for a given alpha
# Input: alpha (significance level)
# Returns: corresponding z-score
def get_z_score(alpha):
    return norm.ppf(alpha)


# Function to compute standard deviations for baseline and expected change
# Inputs:
#   p : baseline conversion rate
#   d : minimum detectable change
# Returns: list [sd_baseline, sd_expected_change]
def get_sds(p, d):
    sd_baseline = mt.sqrt(2 * p * (1 - p))           # SD for baseline
    sd_expected = mt.sqrt(p * (1 - p) + (p + d) * (1 - (p + d)))  # SD for expected change
    return [sd_baseline, sd_expected]


# Function to calculate minimum sample size per group
# Inputs:
#   sds   : list of standard deviations [baseline_sd, expected_sd]
#   alpha : significance level
#   beta  : type II error rate
#   d     : minimum detectable effect
# Returns: minimum sample size per group
def get_sampSize(sds, alpha, beta, d):
    n = ((get_z_score(1 - alpha / 2) * sds[0] + get_z_score(1 - beta) * sds[1]) ** 2) / (d ** 2)
    return n


## Calculate Sample Size per Metric

### Gross Conversion


In [None]:
GC["d"] = 0.01
R["d"] = 0.01
NC["d"] = 0.0075

In [None]:
# Calculate sample size per group for Gross Conversion
sds_gc = get_sds(GC["p"], GC["d_min"])
GC["SampSize"] = round(get_sampSize(sds_gc, alpha=0.05, beta=0.2, d=GC["d_min"]))

GC["SampSize"]


- This means we need at least 25,835 cookies who click the Free Trial button - per group! 
- That means that if we got 400 clicks out of 5000 pageviews (400/5000 = 0.08) -> So, we are going to need `GC["SampSize"]/0.08 = 322,938` pageviews, again ; per group! 
- Finally, the total amount of samples per the Gross Conversion metric is:

In [None]:
# Adjust sample size to account for Click-Through Probability (CTP) and two groups
# GC["SampSize"] originally gives number of clicks needed per group
# Divide by CTP to get the number of visitors needed
# Multiply by 2 to account for both control and experiment groups
GC["SampSize"] = round(GC["SampSize"] / baseline["CTP"] * 2)

GC["SampSize"]


### Retention

In [None]:
# Calculate sample size per group for Retention
sds_r = get_sds(R["p"], R["d_min"])
R["SampSize"] = round(get_sampSize(sds_r, alpha=0.05, beta=0.2, d=R["d_min"]))

R["SampSize"]


This means that we need 39,087 users who enrolled per group! We have to first convert this to cookies who clicked, and then to cookies who viewed the page, then finally to multipky by two for both groups.

In [None]:
# Adjust sample size to account for CTP, GC, and two groups
# R["SampSize"] is originally the number of enrollments per group
# Divide by CTP and GC to get the number of visitors
# Multiply by 2 for control + experiment groups
R["SampSize"] = round(R["SampSize"] / (baseline["CTP"] * baseline["GConversion"]) * 2)

R["SampSize"]


This takes us as high as over 4 million page views total, this is practically impossible because we know we get about 40,000 a day, this would take well over 100 days. This means we have to drop this metric and not continue to work with it because results from our experiment (which is much smaller) will be biased.

### Net Conversion

In [None]:
# Calculate sample size per group for Net Conversion
sds_nc = get_sds(NC["p"], NC["d_min"])
NC["SampSize"] = round(get_sampSize(sds_nc, alpha=0.05, beta=0.2, d=NC["d_min"]))

NC["SampSize"]


So, needing 27,413 cookies who click per group takes us all the way up to:

In [None]:
# Adjust sample size to account for CTP and two groups
# NC["SampSize"] originally gives the number of clicks per group
# Divide by CTP to get the number of visitors
# Multiply by 2 to account for both control and experiment groups
NC["SampSize"] = round(NC["SampSize"] / baseline["CTP"] * 2)

NC["SampSize"]


We are all the way up to 685,325 cookies who view the page. This is more than what was needed for Gross Conversion, so this will be our number. Assuming we take 80% of each days pageviews, the data collection period for this experiment (the period in which the experiment is revealed) will be about 3 weeks.

# Analyzing Collected Data

In [None]:
# we use pandas to load datasets
control = pd.read_csv("control_data.csv")
experiment = pd.read_csv("experiment_data.csv")
control.head()

In [None]:
experiment.head()

### Sanity Checks

We have 3 Invariant metrics::

- Number of Cookies in Course Overview Page
- Number of Clicks on Free Trial Button
- Free Trial button Click-Through-Probability (CTP)


In [None]:
# A significant difference will imply a biased experiment 
# that we should not rely on it's results.

pageviews_cont = control['Pageviews'].sum()
pageviews_exp = experiment['Pageviews'].sum()
pageviews_total = pageviews_cont + pageviews_exp

print ("number of pageviews in control:", pageviews_cont)
print ("number of Pageviewsin experiment:" , pageviews_exp)
print ("number of Pageviewsin + pageviews experiment:" , pageviews_total)

Ok, these numbers look pretty close. Now let's check that this difference is not significant and is random, as we expected. We can model this variation as follows:

We expect the number of pageviews in the **control group** to be about half (50%) of the total pageviews in both groups. We can define a random variable to describe this.

A **binomial random variable** represents the number of successes in N experiments, given the probability of a single success. If we treat being assigned to the control group as a "success" with probability 0.5 (random!), then the number of samples assigned to the control group is the value of this binomial variable.

Thanks to the **Central Limit Theorem**, we can approximate the binomial distribution with a normal distribution (for large N), with:

- **Mean:** $\mu = p$
- **Standard deviation:** $\sigma = \sqrt{\frac{p(1-p)}{N}}$

$$
X \sim N\left(p, \sqrt{\frac{p(1-p)}{N}}\right)
$$


What we want to test is whether our observed $\hat{p}$ (number of samples in control divided by total number of damples in both groups) is not significantly different than $p=0.5$. 

In order to do that we can calculate the margin of error acceptable at a 95% confidence level:

$$
ME = Z_{1-\frac{\alpha}{2}}SD
$$
with confidence interval
$$
CI = [\hat{p}-ME,\hat{p}+ME]
$$

In [None]:
# Given values
p = 0.5                       # assumed population proportion under null
alpha = 0.05                  # significance level (for 95% CI)

# Compute sample proportion
p_hat = round(pageviews_cont / pageviews_total, 4)

# Compute standard deviation of the sampling distribution
sd = mt.sqrt(p * (1 - p) / pageviews_total)

# Compute margin of error
ME = round(get_z_score(1 - alpha / 2) * sd, 4)

# Display results
lower = round(p - ME, 4)
upper = round(p + ME, 4)

print(f"The confidence interval is between {lower} and {upper}; "
      f"Is {p_hat} inside this range? {'Yes' if lower <= p_hat <= upper else 'No'}")


Our observed  $\hat{p}$ is inside this range which means the difference in number of samples between groups is expected. So far so good, since this invariant metric sanity test passes!

In [None]:
# Calculate total clicks
clicks_cont = control["Clicks"].sum()
clicks_exp = experiment["Clicks"].sum()
clicks_total = clicks_cont + clicks_exp

# Compute observed proportion of clicks in control
p_hat = round(clicks_cont / clicks_total, 4)

# Expected proportion under H0 (equal allocation between groups)
p = 0.5
alpha = 0.05

# Standard deviation for a proportion
sd = mt.sqrt(p * (1 - p) / clicks_total)

# Margin of error for 95% confidence interval
ME = round(get_z_score(1 - (alpha / 2)) * sd, 4)

# Confidence interval
lower = round(p - ME, 4)
upper = round(p + ME, 4)

# Print results
print(f"The confidence interval is between {lower} and {upper}; "
      f"Is {p_hat} inside this range? {'Yes' if lower <= p_hat <= upper else 'No'}")


We have another pass! Great, so far it still seems all is well with our experiment results. Now, for the final metric which is a probability.

### Sanity Checks for Differences Between Probabilities

**Click-through Probability of the Free Trial Button**

We want to ensure that the proportion of clicks given a pageview (our observed **CTP**) is about the same in both groups.  
To check this, we calculate the **CTP** in each group and then compute a **confidence interval** for the expected difference between them.

We expect to see no difference:

$$
CTP_{exp} - CTP_{cont} = 0
$$

with an acceptable margin of error.  
The key adjustment here is in the calculation of the **standard error**, which uses a **pooled standard error**:

$$
SD_{pool} = \sqrt{\hat{p}_{pool}(1-\hat{p}_{pool})\left(\frac{1}{N_{cont}}+\frac{1}{N_{exp}}\right)}
$$

where

$$
\hat{p}_{pool} = \frac{x_{cont}+x_{exp}}{N_{cont}+N_{exp}}
$$

In [None]:
# Compute click-through probabilities
ctp_cont = clicks_cont / pageviews_cont
ctp_exp = clicks_exp / pageviews_exp

# Observed difference
d_hat = round(ctp_exp - ctp_cont, 4)

# Pooled click-through probability
p_pooled = clicks_total / pageviews_total

# Pooled standard deviation
sd_pooled = mt.sqrt(p_pooled * (1 - p_pooled) *
                    ((1 / pageviews_cont) + (1 / pageviews_exp)))

# Margin of error for 95% confidence interval
alpha = 0.05
ME = round(get_z_score(1 - (alpha / 2)) * sd_pooled, 4)

# Print results
print(f"The confidence interval is between {-ME:.4f} and {ME:.4f}; "
      f"Is {d_hat:.4f} within this range? {'Yes' if -ME <= d_hat <= ME else 'No'}")


Wonderful. It seems this test has passed with flying colors as well.

### Examining Effect Size

The next step is to examine the changes between the control and experiment groups with respect to our evaluation metrics.  
We want to ensure that the observed difference:

1. **Exists** — there is a measurable difference between groups.  
2. **Is statistically significant** — the difference is unlikely due to random chance.  
3. **Is practically significant** — the difference is large enough to make the experimental change beneficial to the company.

> **Note:**  
> A metric is **statistically significant** if the confidence interval does **not include 0**  
> (meaning you can be confident there was a change).  
>  
> A metric is **practically significant** if the confidence interval does **not include the practical significance boundary**  
> (meaning you can be confident the change is large enough to matter to the business).

In [None]:
# Count total clicks from complete records only
clicks_cont = control.loc[control["Enrollments"].notnull(), "Clicks"].sum()
clicks_exp = experiment.loc[experiment["Enrollments"].notnull(), "Clicks"].sum()

In [None]:
# Gross Conversion - Enrollments divided by Clicks
enrollments_cont = control["Enrollments"].sum()
enrollments_exp = experiment["Enrollments"].sum()

GC_cont = enrollments_cont / clicks_cont
GC_exp = enrollments_exp / clicks_exp

# Pooled Gross Conversion rate
GC_pooled = (enrollments_cont + enrollments_exp) / (clicks_cont + clicks_exp)

# Pooled standard deviation
GC_sd_pooled = mt.sqrt(GC_pooled * (1 - GC_pooled) *
                       ((1 / clicks_cont) + (1 / clicks_exp)))

# Margin of error for 95% CI
GC_ME = round(get_z_score(1 - alpha/2) * GC_sd_pooled, 4)

# Observed difference
GC_diff = round(GC_exp - GC_cont, 4)

# Results
print(f"The change due to the experiment is {GC_diff*100:.2f}%")
print(f"Confidence Interval: [{GC_diff - GC_ME:.4f}, {GC_diff + GC_ME:.4f}]")
print("The change is statistically significant if the CI doesn't include 0.")
print(f"In that case, it is practically significant if {-GC['d_min']} is not in the CI as well.")


According to this result, the experiment caused a change that was both **statistically** and **practically significant**.  

We observed a **negative change of 2.06%**, while we were willing to accept any change greater than 1%. This means the **Gross Conversion rate** of the experiment group (those exposed to the change, i.e., asked how many hours they can devote to studying) **decreased by about 2%**.  

In practical terms, this indicates that **fewer people enrolled in the Free Trial** after seeing the pop-up.


In [None]:
# Net Conversion - Payments divided by Clicks
payments_cont = control["Payments"].sum()
payments_exp = experiment["Payments"].sum()

NC_cont = payments_cont / clicks_cont
NC_exp = payments_exp / clicks_exp

# Pooled Net Conversion rate
NC_pooled = (payments_cont + payments_exp) / (clicks_cont + clicks_exp)

# Pooled standard deviation
NC_sd_pooled = mt.sqrt(NC_pooled * (1 - NC_pooled) *
                       ((1 / clicks_cont) + (1 / clicks_exp)))

# Margin of error for 95% CI
NC_ME = round(get_z_score(1 - alpha/2) * NC_sd_pooled, 4)

# Observed difference
NC_diff = round(NC_exp - NC_cont, 4)

# Results
print(f"The change due to the experiment is {NC_diff*100:.2f}%")
print(f"Confidence Interval: [{NC_diff - NC_ME:.4f}, {NC_diff + NC_ME:.4f}]")
print("The change is statistically significant if the CI doesn't include 0.")
print(f"In that case, it is practically significant if {NC['d_min']} is not in the CI as well.")


In this case we got a change size of less than a 0.5%, a very small decrease which is not statistically significant, and as such not practically significant.

## Double Check with Sign Tests

A **sign test** provides another perspective on our results.  
It checks whether the trend of change we observed (increase or decrease) is consistently evident in the **daily data**.

In [None]:
# Merge control and experiment datasets side by side
full = control.join(
    other=experiment,
    how="inner",           # keep only rows present in both datasets
    lsuffix="_cont",       # suffix for control columns
    rsuffix="_exp"         # suffix for experiment columns
)

# Inspect the first few rows
full.head()

In [None]:
full.count()

In [None]:
# Keep only complete records
full = full.loc[full["Enrollments_cont"].notnull()]

# Count remaining rows per column
full.count()

In [None]:
# Gross Conversion (GC) daily comparison
x = full['Enrollments_cont'] / full['Clicks_cont']
y = full['Enrollments_exp'] / full['Clicks_exp']
full['GC'] = np.where(y > x, 1, 0)  # 1 if experiment GC > control GC

# Net Conversion (NC) daily comparison
z = full['Payments_cont'] / full['Clicks_cont']
w = full['Payments_exp'] / full['Clicks_exp']
full['NC'] = np.where(w > z, 1, 0)  # 1 if experiment NC > control NC

# Inspect first few rows
full.head()

In [None]:
# Count the number of days where experiment outperformed control
GC_x = full.GC[full["GC"] == 1].count()
NC_x = full.NC[full["NC"] == 1].count()

# Total number of observations
n = full.NC.count()

# Print results
print(f"""
No. of cases where experiment GC > control GC: {GC_x}
No. of cases where experiment NC > control NC: {NC_x}
Total number of cases: {n}
""")


### Building a Sign Test

After counting the number of days in which the **experiment group** had a higher metric value than the **control group**, we want to determine if that number is likely to occur by **random chance** in a new experiment (i.e., test for significance).  

We assume that the chance of a day like this is **50%**, and then use the **binomial distribution** with $p=0.5$ and the number of days $n$ to calculate the probability of observing this many “successful” days by random chance.

According to the binomial distribution:

$$
p(\text{successes}) = \frac{n!}{x!(n-x)!} p^x (1-p)^{n-x}
$$

where:  
- $n$ = total number of days  
- $x$ = number of days with a higher metric in the experiment  
- $p = 0.5$  

Because we are doing a **two-tailed test**, we double this probability to get the **p-value**. We then compare the p-value to our significance level $\alpha$:  
- If $p > \alpha$, the result is **not significant**  
- If $p \leq \alpha$, the result is **significant**  

**Recall:** A p-value is the probability of observing a test statistic as extreme or more extreme than the one observed.  

For example, if we observe 2 days like that, the p-value is:

$$
p = P(x \leq 2) = p(0) + p(1) + p(2)
$$

In [None]:
# Probability of exactly x successes out of n trials (p=0.5)
def get_prob(x, n):
    prob = mt.factorial(n) / (mt.factorial(x) * mt.factorial(n - x)) * 0.5**n
    return round(prob, 4)

# Two-sided p-value for observing x or fewer successes
def get_2side_pvalue(x, n):
    p = 0
    for i in range(0, x + 1):
        p += get_prob(i, n)
    return 2 * p


Finally, to conduct the **sign test** itself, we calculate the **p-value** for each metric using the counts `GC_x`, `NC_x`, and `n`, along with the function `get_2side_pvalue`.

In [None]:
# Check statistical significance for GC and NC
print("GC Change is significant if", get_2side_pvalue(GC_x, n), "is smaller than 0.05")
print("NC Change is significant if", get_2side_pvalue(NC_x, n), "is smaller than 0.05")

We get the same conclusions as we got from our effect size calculation: the change in Gross conversion was indeed significant, while the change in Net conversion was not.

# Conclusions & Recommendations

At this point, having observed that the underlying goal—**increasing the fraction of paying users by asking them in advance if they have time to invest in the course**—was not achieved, our recommendation is to **not continue with this change**.  

While the experiment may have caused a change in **Gross Conversion**, it did **not improve Net Conversion**.