# Case Study: Navigating Vanguard's Digital Redesign

Note: This is one of many possible solutions. This should serve as a guide.

As we saw in Part I, we need to deal with duplicates and null values.

Ideally, we would have saved that code in functions, and here we would just call the functions, or we would have saved the clean datasets either as CSVs or as pickles.

Since there is not much code regarding data cleaning as our main focus is not that one, we will just rewrite here the code below reading the datasets.

In [None]:
import pandas as pd

path = 'https://github.com/data-bootcamp-v4/lessons/raw/main/6_inf_stats/files_for_lessons/'

# Correctly parsing the dataframes
df_demo = pd.read_csv(path+'df_final_demo.txt', sep=",")
df_experiment_clients = pd.read_csv(path+'df_final_experiment_clients.txt', sep=",")
df_web_data_pt_1 = pd.read_csv(path+'df_final_web_data_pt_1.txt', sep=",")
df_web_data_pt_2 = pd.read_csv(path+'df_final_web_data_pt_2.txt', sep=",")

# Combining web data
df_web_data = pd.concat([df_web_data_pt_1, df_web_data_pt_2], ignore_index=True)

In [None]:
# If there are duplicates, we will drop them
if duplicates["df_demo"] > 0:
    df_demo.drop_duplicates(inplace=True)
if duplicates["df_experiment_clients"] > 0:
    df_experiment_clients.drop_duplicates(inplace=True)
if duplicates["df_web_data"] > 0:
    df_web_data.drop_duplicates(inplace=True)

# Cleaning data
    
df_demo['gendr'].fillna(df_demo['gendr'].mode()[0], inplace=True)


## **1. Client Behavior Analysis**

### **Engagement Demographics**

    - Who are the primary clients using this online process? Are they younger or older, new or long-standing clients? Do a client behaviour analysis to answer any relevant questions you might think of.

#### Age

In [None]:
import matplotlib.pyplot as plt

# Plot the age distribution
plt.figure(figsize=(10, 5))
plt.hist(df_demo['clnt_age'], bins=20, edgecolor='k', alpha=0.7)
plt.xlabel('Client Age')
plt.ylabel('Number of Clients')
plt.title('Distribution of Client Ages')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.show()


The age distribution reveals that the majority of clients are between the ages of 20 and 70, with a peak around the 30s and the 45s to 65s approximately.

#### Tenure

Let's examine the distribution of client tenure to determine if they are new or long-standing clients.

In [None]:
# Plot the client tenure distribution
plt.figure(figsize=(10, 5))
plt.hist(df_demo['clnt_tenure_yr'], bins=20, edgecolor='k', alpha=0.7)
plt.xlabel('Client Tenure (in years)')
plt.ylabel('Number of Clients')
plt.title('Distribution of Client Tenure')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.show()


The client tenure distribution indicates that a significant portion of clients have been with the platform for less than 10 years, with a peak around 5-7 years of tenure. This suggests that there are many relatively newer clients.



## 2. Performance Metrics

- **Success Indicators**:
    - What key performance indicators (KPIs) will determine the success of the new design? (e.g., completion rate, time spent on each step, error rates)
- **Redesign Outcome**:
    - Based on the chosen KPIs, how does the new design's performance compare to the old one?

To determine the success of the new design, we need to identify key performance indicators (KPIs). Given the data, potential KPIs include:

- **Completion Rate**: The proportion of users who reach the final 'confirm' step.
- **Time Spent on Each Step**: The average duration users spend on each step before moving to the next.
- **Error Rates**: If there's a step where users often go back to a previous step, it may indicate confusion or an error.

**Redesign Outcome:**

After selecting the KPIs, we'll calculate these metrics for both the Test and Control groups to compare the performance of the new design against the old one.

### Completion Rate

Let's start by calculating the **Completion Rate** for both the Test and Control groups.

    - For each group (Test and Control), let's calculate the number of users who reached the 'confirm' step and divide it by the total number of users in that group.
    - This gives us the proportion (or probability) of users completing the process. 
    - Mathematically, for the Test group, the completion rate is given by:
$$
\text{Completion Rate (Test)} = \frac{\text{Number of 'Test' users reaching 'confirm'}}{\text{Total 'Test' users}}
$$

    - The same formula applies for the Control group. This is a direct application of probability where we determine the likelihood of an event (completion) occurring.

In [None]:
# Merge the web data with the experiment clients data to know which group each client belongs to
df_web_experiment_merged = df_web_data.merge(df_experiment_clients, on='client_id', how='left')

# Filter out the rows where the process step is 'confirm'
df_confirmations = df_web_experiment_merged[df_web_experiment_merged['process_step'] == 'confirm']

# Calculate completion rate for both Test and Control groups
completion_rates = df_confirmations['Variation'].value_counts() / df_experiment_clients['Variation'].value_counts()

completion_rates


The calculated completion rates for both the Test and Control groups are:

- **Test Group (New Design)**: Approximately 94.92%
- **Control Group (Old Design)**: Approximately 73.66%

This indicates that users exposed to the new design were more likely to reach the final 'confirm' step compared to users exposed to the old design.

### Time Spent on Each Step

Next, let's calculate the **Time Spent on Each Step**. We'll determine the time difference between each step for each visit and then calculate the average duration users spend on each step before moving to the next.

The result will provide insights into the average time users of both the Test (new design) and Control (old design) groups spend on each of the process steps.


    - For each visit, we calculate the time difference between consecutive steps.
    - We then average these time differences for each step across all visits.
    - This does not directly use probability, but averages (or means) to understand typical user behavior.

In [None]:
# Convert the 'date_time' column to datetime format
df_web_experiment_merged['date_time'] = pd.to_datetime(df_web_experiment_merged['date_time'])

# Sort the data by client_id, visit_id, and date_time to ensure steps are in order
df_web_experiment_merged = df_web_experiment_merged.sort_values(by=['client_id', 'visit_id', 'date_time'])

# Calculate the time difference between each step for each visit
df_web_experiment_merged['time_diff'] = df_web_experiment_merged.groupby(['client_id', 'visit_id'])['date_time'].diff()

# Calculate the average duration users spend on each step for both Test and Control groups
average_time_per_step = df_web_experiment_merged.groupby(['Variation', 'process_step'])['time_diff'].mean()

# Convert the time difference to minutes for easier interpretation
average_time_per_step = average_time_per_step.dt.total_seconds() / 60

# Reset index for better presentation
average_time_per_step = average_time_per_step.reset_index()

average_time_per_step


The results show the average time (in minutes) users spend on each step for both the Test and Control groups.

### Error Rates

To address the **Error Rates KPI**, one approach is to identify instances where users go back to a previous step, suggesting possible confusion or an error.

Let's calculate the error rates by identifying these instances. We'll consider moving from a later step to an earlier one as an error.


    - We identify instances where users moved from a later step to an earlier one (indicating possible confusion or errors).
    - For each group, the error rate is calculated as the proportion of these "error" instances to the total number of steps taken.
    - This is another application of probability, where we determine the likelihood of an error occurring. 
    - Mathematically, for the Test group, the error rate is given by:

$$
\text{Error Rate (Test)} = \frac{\text{Number of 'backward' steps taken by 'Test' users}}{\text{Total steps taken by 'Test' users}}
$$

    - The same formula applies for the Control group.



In [None]:
# We identify instances where users moved from a later step to an earlier one 
# (indicating possible confusion or errors).

# Assign a numerical value to each process step for easier comparison
step_mapping = {'start': 0, 'step_1': 1, 'step_2': 2, 'step_3': 3, 'confirm': 4}
df_web_experiment_merged['step_value'] = df_web_experiment_merged['process_step'].map(step_mapping)

# Calculate the step difference for each consecutive action within a visit
df_web_experiment_merged['step_diff'] = df_web_experiment_merged.groupby(['client_id', 'visit_id'])['step_value'].diff()

# Identify errors where the step difference is negative (i.e., moving from a later step to an earlier one)
df_web_experiment_merged['is_error'] = df_web_experiment_merged['step_diff'] < 0

# Calculate error rates for both Test and Control groups
error_rates = df_web_experiment_merged.groupby('Variation')['is_error'].mean()

error_rates

The calculated error rates for both the Test and Control groups are:

- **Test Group (New Design)**: Approximately 9.1%
- **Control Group (Old Design)**: Approximately 6.8%

This suggests that users exposed to the new design experienced a slightly higher rate of errors (or instances where they went back to a previous step) compared to users exposed to the old design.

**Note**: The 'is_error' column is a binary indicator where `1` means an error (backward step) occurred, and `0` means no error. 

When you take the mean of a binary column:
- The numerator is the sum of `1`s (which represents the number of 'backward' steps).
- The denominator is the total number of rows (or the total number of steps taken by the group).

This aligns with the provided formula above. The `.mean()` method on a binary column in pandas calculates the proportion of `1`s to the total, which in this case is the error rate.

**Summary**:
1. The **Completion Rate** for the new design is higher than the old design, indicating better user engagement or clarity with the new design.
2. The **Time Spent on Each Step** it's a valuable metric for understanding user engagement and potential bottlenecks in the process.
3. The **Error Rates** suggest that users of the new design had a slightly higher likelihood of going back to a previous step, possibly indicating confusion or hesitation.

Based on these KPIs, the new design **seems** to improve completion rates but may introduce some points of confusion that cause users to revisit previous steps. Further analysis or user feedback might be needed to understand the reasons behind these behaviors and refine the design accordingly.
Furthermore, we need to conduct hypothesis testing to make data-driven conclusions about the effectiveness of the redesign.



## **3. Hypothesis Testing**

As part of your analysis, you'll conduct hypothesis testing to make data-driven conclusions about the effectiveness of the redesign.

### Completion rate

#### Before Conducting a Hypothesis Test

- **Step 1: Define the Metric and Threshold.** 
    - In our case, the primary metric is the completion rate. There is no threshold.

- **Step 2: Compute the Observed Completion Rates for Test and Control Groups**
    - We'll first calculate the completion rates for both the Test and Control groups using unique clients. This was already above.

- **Step 3: Determine the Observed Difference**
    - After calculating the completion rates for both groups, we'll determine the observed difference between them to see if the new design (Test group) resulted in an improvement. This was done above.

#### 1. **State the Hypothesis**

In our case, the primary metric is the completion rate. Since the new design (Test group) had a higher completion rate compared to the old design (Control group), we might be interested in confirming if this difference is statistically significant.

**Hypothesis**:
- **Null Hypothesis ($H_0$))**: The completion rate for the Test group (new design) is equal to the completion rate for the Control group (old design).
- **Alternative Hypothesis ($H_a$))**: The completion rate for the Test group (new design) is not equal to the completion rate for the Control group (old design).

To test this hypothesis, we will use a two-proportion z-test. This test is appropriate when comparing proportions (like completion rates) between two groups.


#### 2. **Choose the Right Statistical Test**

Given that we are comparing proportions between two groups, a two-proportion z-test is appropriate.

#### 3. Test Assumptions

The two-proportion *z-test* has certain assumptions that need to be met. Let's check these assumptions for our data:

1. **Random Sampling**: The data should be randomly sampled, which means every member of the population has an equal chance of being included in the sample. 
    - This assumption often relies on the method of data collection. We don't have explicit information about the sampling method from the provided data, so we'll have to rely on external context. Typically, in experiments like these, users are randomly assigned to Test or Control groups, so this assumption might be reasonable.

2. **Large Sample Size**: Both sample sizes should be large enough. We'll check this below.
   
3. **Independence**: The samples should be independent of each other. If sampling without replacement, the sample size should not be more than 10% of the population to ensure independence.
    
    - Again, this often depends on the data collection method. If users were randomly assigned to Test or Control groups, their experiences and outcomes would be independent of each other. However, without explicit information, this is an assumption we're making.

Given the checks we've performed and typical practices in A/B testing, it seems reasonable to proceed with the two-proportion *z-test*, but always with the understanding that our conclusions are as good as the assumptions behind the test.



##### Large Sample Size check - rule of thumb

The **Central Limit Theorem (CLT)** tells us that, for large sample sizes, the sampling distribution of the sample proportion is approximately normally distributed, regardless of the distribution of the underlying data.

While we don't **check the normality** of our raw data in the same way we might for other tests (e.g., using a **Shapiro-Wilk** test or a **QQ plot**, since data is not continuous), we do ensure that our sample size and proportions are such that the **sampling distribution of the proportion is approximately normal**.

In the case of the binomial distribution (which is the underlying distribution for proportions), we can check if it can be approximated by a normal distribution by  checking these conditions:
$np>=5$ and $n(1-p)>=5$

where 
     $n$ is total sample size and
     $p$ is proportion of successes for both the Test and Control groups.


These conditions are commonly taught rules of thumb for binomial distributions in introductory statistics courses.


In [None]:
# Determine the number of users and successes for each group
grouped_data = df_web_experiment_merged.groupby('Variation').agg(
    total_users=('client_id', 'nunique'),
    successes=('process_step', lambda x: (x == 'confirm').sum())
).reset_index()

# Calculate the proportion of successes
grouped_data['p'] = grouped_data['successes'] / grouped_data['total_users']

# Calculate the conditions np>=5 and n(1-p)>=5
grouped_data['np'] = grouped_data['total_users'] * grouped_data['p']
grouped_data['n(1-p)'] = grouped_data['total_users'] * (1 - grouped_data['p'])

grouped_data[['Variation', 'total_users', 'successes', 'p', 'np', 'n(1-p)']]



The conditions ```np>=5 and n(1-p)>=5``` for both the Test and Control groups have been calculated.

Both groups meet the conditions, suggesting that the sample sizes are sufficiently large. This means we can reasonably proceed with the two-proportion z-test, under the assumption that the sampling distribution of the sample proportion is approximately normally distributed.

#### 4. Conduct the test

Let's conduct the hypothesis test.

In [None]:
from statsmodels.stats.proportion import proportions_ztest

# Number of successes (completions) for each group
count = df_confirmations['Variation'].value_counts()

# Number of trials (total users) for each group
nobs = df_experiment_clients['Variation'].value_counts()

# Conduct two-proportion z-test
z_stat, p_value = proportions_ztest(count, nobs)

z_stat, p_value

The results from the two-proportion z-test are as follows:

- **Z-statistic**: 66.77
- **p-value**: 0.0



#### 5. **Interpretation and decision**

- Given the very low p-value (essentially 0), we can **reject the null hypothesis $H_0$** at any conventional significance level (e.g., $\alpha = 0.05$). This means there is statistically significant evidence to suggest that the completion rate for the Test group (new design) is different from the Control group (old design).

- Given our previous calculations, we know the **completion rate for the Test group is higher**. Therefore, our analysis supports the claim that the new design leads to a significantly higher completion rate compared to the old design.

- Given the statistically significant improvement in the completion rate, Vanguard **should implement the new UI design** as it has proven to be statistically better.

### Completion rate with a Cost-Effectiveness Threshold

The introduction of a new UI design comes with its associated costs: design, development, testing, potential training for staff, and possible short-term disruptions or adjustments for users. To justify these costs, Vanguard has determined that any new design should lead to a minimum increase in the completion rate to be deemed cost-effective.

**Threshold**: Vanguard has set this minimum increase in completion rate at **5%**. This is the rate at which the projected benefits, in terms of increased user engagement and potential revenue, are estimated to outweigh the costs of the new design.

Do another analysis, ensuring that the observed increase in completion rate from the A/B test meets or exceeds this **5%** threshold. If the new design doesn't lead to at least this level of improvement, it may not be justifiable from a cost perspective, regardless of its statistical significance.

#### Before Conducting a Hypothesis Test

- **Step 1: Define the Metric and Threshold.** 
    - In our case, the primary metric is the completion rate. The minimum increase in completion rate that Vanguard deems necessary to justify the costs associated with the new UI design is 5%.

- **Step 2: Compute the Observed Completion Rates for Test and Control Groups**
    - We'll first calculate the completion rates for both the Test and Control groups using unique clients. This was already done in the test above.

- **Step 3: Determine the Observed Difference**
    - After calculating the completion rates for both groups, we'll determine the observed difference between them to see if the new design (Test group) resulted in an improvement.

In [None]:
# Calculate the observed completion rates for Control and Test groups
p_control_observed = completion_rates['Control']
p_test_observed = completion_rates['Test']

# Calculate the observed difference in completion rates
observed_difference = p_test_observed - p_control_observed

# Check if the observed difference meets the 5% threshold
meets_threshold = observed_difference >= 0.05

observed_difference, meets_threshold


The observed difference in completion rates between the Test group (with the new design) and the Control group (with the old design) is approximately 21%, which is substantially higher than Vanguard's threshold of 5%.

However, we should still test if this 21% difference is statistically significant. If it is, then while the new design might not be justifiable purely from a cost perspective, it still represents a genuine improvement.

Let's conduct the two-proportion z-test again to determine the statistical significance of the observed difference.

#### 1. **State the Hypothesis**

Given the goal of the experiment and the 5% threshold:

**Null Hypothesis ($(H_0$)):** The completion rate for the Test group (new design) is equal to or less than the completion rate for the Control group (old design) increased by 5%.

**Alternative Hypothesis ($(H_a$)):** The completion rate for the Test group (new design) is greater than the completion rate for the Control group (old design) increased by 5%.

#### 2. **Choose the Right Statistical Test**

Given that we are comparing proportions between two groups, a **one-sided** two-proportion z-test is appropriate.

#### 3. **Check for Assumptions**

This is the same as what we did in the test above.


#### 4. **Conduct the Test**

We've already performed the test earlier, but given the new threshold, we'll compare the completion rate of the Test group to the completion rate of the Control group increased by 5%.



In [None]:
# In the previous test, we did
# Number of successes (completions) for each group
count = df_confirmations['Variation'].value_counts()

# Number of trials (total users) for each group
nobs = df_experiment_clients['Variation'].value_counts()

# Calculate the adjusted number of successes for Control
# Adjust the control proportion by adding the 5% threshold
adjusted_p_control = completion_rates['Control'] + 0.05
n_control = df_experiment_clients['Variation'].value_counts()['Control']
count_adjusted = [count[0], int(adjusted_p_control * n_control)]

# Conduct the two-proportion z-test using the adjusted control count
z_stat_adjusted, p_value_adjusted = proportions_ztest(count_adjusted, nobs, alternative='larger')  # one-sided test

z_stat_adjusted, p_value_adjusted


#### 5. **Interpret Results and Decision**

- Given the extremely low p-value (essentially 0) and the nature of the test being one-sided, we can reject the null hypothesis ($(H_0$)) at any conventional significance level (e.g., $( \alpha = 0.05 $)). This confirms there's statistically significant evidence to suggest that the completion rate for the Test group (with the new design) is greater than the completion rate for the Control group (with the old design) increased by 5%.

- Given the statistically significant improvement in the completion rate and the fact that this improvement meets Vanguard's cost-effectiveness threshold, Vanguard **should implement the new UI design** as it not only has proven to be statistically better but also meets the practical significance set by the company in terms of cost justification.


### Other Hypothesis Examples

Choose other hypothesis you want to test. For example, you might want to test whether the average age of clients engaging with the new process is the same as those engaging with the old process; or if the average client tenure (how long they've been with Vanguard) of those engaging with the new process is the same as those engaging with the old process; or if there are gender differences that affect engaging with the new or old process etc.

#### Average age of clients

Let's test the hypothesis regarding the **average age of clients** between the two groups (Test and Control).

##### **Hypothesis**

1. **Null Hypothesis $(H_0$)**: The average age of clients for the Test group (new design) is equal to the average age of clients for the Control group (old design).
2. **Alternative Hypothesis $(H_a$)**: The average age of clients for the Test group (new design) is not equal to the average age of clients for the Control group (old design).

##### Statistical Test

To test this hypothesis, we will use a two-sample t-test, which is appropriate for comparing means between two independent groups.

##### Test assumptions

The assumptions for the independent two-sample t-test (which we will apply here) include:

1. **Independence of Observations**: The two groups (Test and Control) are independent of each other.
2. **Normality**: The dependent variable should be approximately normally distributed in each group. This can be checked using plots (like Q-Q plots) or tests (like the Shapiro-Wilk test).
3. **Homogeneity of Variances**: The variances of the dependent variable should be equal in the two groups. This can be checked using Levene's test. If the variances are not equal, we can still conduct the t-test by adjusting for unequal variances, using the `equal_var=False` parameter.

Let's check the assumptions of normality and homogeneity of variances.


1. **Normality Assumption**:

   

It's worth noting that the t-test is robust against this assumption when sample sizes are large, as in our case.
If we didn't have a large sample size (N > 5000) we would run a Shapiro test as follows:
```python
# Check for normality using Shapiro-Wilk test for both groups
shapiro_test = stats.shapiro(ages_test)
shapiro_control = stats.shapiro(ages_control)

shapiro_test, shapiro_control
```



2. **Homogeneity of Variances Assumption**:


In [None]:
from scipy import stats

# Now merge the data to get the age of clients for each variation
df_full_merged = pd.merge(df_web_experiment_merged, df_demo, on='client_id', how='inner')

# Extract the client ages for the Test and Control groups
ages_test = df_full_merged[df_full_merged['Variation'] == 'Test']['clnt_age']
ages_control = df_full_merged[df_full_merged['Variation'] == 'Control']['clnt_age']

# Check for homogeneity of variances using Levene's test
levene_result = stats.levene(ages_test, ages_control)
levene_result

   - Levene's test provides a p-value of \(0.71\), meaning we fail to reject the null hypothesis. Thus, we can assume that the variances between the two groups are approximately equal.


Given these results:
- The t-test is known to be robust against violations of normality assumption, especially with large sample sizes.
- We have met the assumption of equal variances.

Considering the large sample size and the central limit theorem (which suggests that the sampling distribution of the mean will be approximately normal for large samples regardless of the underlying distribution), our t-test results are still valid.


##### Run test, interpret results and make a decision

In [None]:
# Extract ages for Test and Control groups using the correct column name
ages_test = df_full_merged[df_full_merged['Variation'] == 'Test']['clnt_age'].dropna()
ages_control = df_full_merged[df_full_merged['Variation'] == 'Control']['clnt_age'].dropna()

# Conduct a two-sample t-test
t_stat, p_value_age = stats.ttest_ind(ages_test, ages_control, equal_var=False)  # Assuming unequal variances

t_stat, p_value_age

The results of the two-sample t-test for the average age of clients engaging with the new design versus the old design are as follows:

- **T-Statistic**: 7.83
- **P-Value**: $(4.71 \times 10^{-15}$)

Given the extremely low p-value, we can reject the null hypothesis ($(H_0$)) at any conventional significance level (e.g., $( \alpha = 0.05 $)). This means there is statistically significant evidence to suggest that the average age of clients engaging with the new design (Test group) is different from those engaging with the old design (Control group).

The positive t-statistic indicates that the average age of clients in the Test group is higher than that in the Control group.

Therefore, this analysis suggests that the new design might be more appealing or user-friendly to an older demographic compared to the old design.

**Decision**:

If Vanguard's goal was to make the new design more attractive to an older demographic, then this is a positive outcome. However, if the aim was a universal appeal or targeting a younger demographic, then the design might need further adjustments. The next steps would depend on Vanguard's business objectives and target audience for the platform.

#### Average client tenure

Let's explore the hypothesis regarding the **average client tenure** between the two groups (Test and Control). This can provide insights into whether newer clients react differently to the design compared to longer-standing clients. 

**Hypothesis:**
1. **Null Hypothesis ($H_0$)**: The average client tenure for the Test group (new design) is equal to the average client tenure for the Control group (old design).
2. **Alternative Hypothesis ($H_a$)**: The average client tenure for the Test group (new design) is not equal to the average client tenure for the Control group (old design).

To test this hypothesis, we will use a two-sample t-test since we are comparing the means of two independent groups. 

Let's proceed with this analysis.


In [None]:
# Calculate the average client tenure (in years) for both Test and Control groups
avg_tenure_summary = df_demo_experiment_merged.groupby('Variation')['clnt_tenure_yr'].mean()

avg_tenure_summary

Now, we'll conduct the two-sample t-test to compare the means of the client tenures between the Test and Control groups. This test will help us determine if there's a statistically significant difference between the average client tenures of the two groups.

In [None]:
from scipy.stats import ttest_ind

# Extract the client tenures for the Test and Control groups
test_tenure = df_demo_experiment_merged[df_demo_experiment_merged['Variation'] == 'Test']['clnt_tenure_yr']
control_tenure = df_demo_experiment_merged[df_demo_experiment_merged['Variation'] == 'Control']['clnt_tenure_yr']

# Conduct the two-sample t-test
t_stat, p_value = ttest_ind(test_tenure, control_tenure)

t_stat, p_value

The results of the two-sample t-test are as follows:

- $ t $-statistic: -1.7121
- $ p $-value: 0.0869

Given the $ p $-value of 0.0869, which is greater than the typical significance level of 0.05, we fail to reject the null hypothesis $( H_0 $). This means that we don't have enough statistical evidence to claim that the average client tenure for the Test group is different from that of the Control group.

In conclusion, based on this test, there's no significant difference in the average client tenure between the Test and Control groups.

#### Gender Differences in Engaging with the New or Old Process

Let's test a different hypothesis: **Gender Differences in Engaging with the New or Old Process**.

**Hypothesis:**
1. **Null Hypothesis $(H_0$)**: The proportion of male clients engaging with the new design is the same as the proportion of male clients engaging with the old design.
2. **Alternative Hypothesis $(H_a$)**: The proportion of male clients engaging with the new design is different from the proportion of male clients engaging with the old design.
We'll proceed with this analysis, looking at gender differences between the two groups.



To test the given hypothesis about gender differences between the two groups, we'll be using a test of proportions. Specifically, we can use the two-proportion z-test.

Here's the plan:

1. Extract the number of female clients in the Test group and the Control group.
2. Extract the total number of clients in both groups.
3. Use the two-proportion z-test to compare the proportions.

Let's start by extracting the necessary data.

In [None]:
# Extract the number of female clients in the Test and Control groups
female_test = df_demo_experiment_merged[(df_demo_experiment_merged['Variation'] == 'Test') & (df_demo_experiment_merged['gendr'] == 'F')].shape[0]
female_control = df_demo_experiment_merged[(df_demo_experiment_merged['Variation'] == 'Control') & (df_demo_experiment_merged['gendr'] == 'F')].shape[0]

# Extract the total number of clients in the Test and Control groups
total_test = df_demo_experiment_merged[df_demo_experiment_merged['Variation'] == 'Test'].shape[0]
total_control = df_demo_experiment_merged[df_demo_experiment_merged['Variation'] == 'Control'].shape[0]

female_test, female_control, total_test, total_control


Now, let's perform the two-proportion z-test to compare the proportions of female clients engaging with the new design versus the old design.

In [None]:
from statsmodels.stats.proportion import proportions_ztest

# Define the counts of successes (female clients) and the total number of trials
count = [female_test, female_control]
nobs = [total_test, total_control]

# Perform the two-proportion z-test
z_stat, p_value = proportions_ztest(count, nobs)

z_stat, p_value




The results of the two-proportion z-test are as follows:

- $ z $-statistic: 0.63
- $ p $-value: 0.52

Given the $ p $-value of 0.1677, which is greater than the typical significance level of 0.05, we fail to reject the null hypothesis $( H_0 $). This means that we don't have enough statistical evidence to claim that the proportion of female clients engaging with the new design is different from the proportion of female clients engaging with the old design.

In conclusion, based on this test, there's no significant gender difference in terms of engaging with the new or old process.

### **4. Experiment Evaluation**

- **Design Effectiveness**:
    - Was the experiment well-structured? Were clients randomly and equally divided between the old and new designs? Were there any biases?

To evaluate the design effectiveness and ensure that the experiment was well-structured, we need to address the following points:

1. **Random Assignment**: Clients should be randomly assigned to either the Test or Control group to ensure that the groups are comparable at the start of the experiment.
2. **Group Sizes**: The groups should be roughly equal in size or at least have enough samples to achieve sufficient statistical power. A substantial imbalance might raise concerns about the random assignment process.
3. **Baseline Comparability**: We should check if the two groups were comparable in terms of key metrics (like age, gender, tenure, etc.) before the intervention. Any significant differences could indicate biases in group assignment.
4. **Other Biases**: We need to identify any other potential biases that could affect the experiment's outcome, such as the time of year the experiment was conducted, external events, etc.

Let's start by examining the group sizes and then check the baseline comparability between the Test and Control groups in terms of age, gender, and tenure.


In [None]:
merged_data = df_demo_experiment_merged.copy()

In [None]:
# Checking the distribution of clients in Test and Control groups
group_distribution = merged_data['Variation'].value_counts()

# Checking baseline comparability in terms of age, gender, and tenure
mean_age_test = merged_data[merged_data['Variation'] == 'Test']['clnt_age'].mean()
mean_age_control = merged_data[merged_data['Variation'] == 'Control']['clnt_age'].mean()

male_prop_test = male_test / total_test
male_prop_control = male_control / total_control

mean_tenure_test = merged_data[merged_data['Variation'] == 'Test']['clnt_tenure_yr'].mean()
mean_tenure_control = merged_data[merged_data['Variation'] == 'Control']['clnt_tenure_yr'].mean()

group_distribution, (mean_age_test, mean_age_control), (male_prop_test, male_prop_control), (mean_tenure_test, mean_tenure_control)


Here are the findings regarding the design effectiveness:

1. **Group Sizes**:
   - Test group: 26,968 clients
   - Control group: 23,532 clients

   The groups are not of equal size, but they are sufficiently large, and the imbalance is not substantial.

2. **Baseline Comparability**:
   - **Age**:
     - Test group: Average age is 47.16 years
     - Control group: Average age is 47.50 years
     
   The average ages are quite close between the two groups.
   
   - **Gender**:
     - Test group: 33.29% are male
     - Control group: 33.87% are male
     
   The proportions of male clients are also close between the two groups.
   
   - **Tenure**:
     - Test group: Average tenure is 11.98 years
     - Control group: Average tenure is 12.09 years
     
   The average tenures are quite close between the two groups.

3. **Other Biases**:
   - We have not specifically checked for other biases like the time of year the experiment was conducted, external events, etc. Further information or additional data would be required to evaluate these potential biases.

In conclusion, based on the information we have, the experiment appears to be well-structured. The groups are comparable in terms of key metrics at the beginning of the experiment, indicating that clients were likely randomly assigned. The slight size difference between the Test and Control groups is not a major concern given the large sample sizes. However, to completely rule out any biases, we would need more context or additional data on other potential factors that could influence the results.

- **Duration Assessment**:
    - Was the timeframe of the experiment adequate to gather meaningful data and insights?

To assess whether the timeframe of the experiment was adequate, we'll consider the following factors:

1. **Duration of Experiment**: We should first determine the exact duration of the experiment in terms of days.
2. **Volume of Data**: A longer experiment duration might be necessary if there is low client activity, and a shorter duration might suffice if there's high activity.
3. **Variability Over Time**: If there are weekly or monthly patterns or any other seasonality in client engagement, a longer duration would capture those patterns.
4. **External Factors**: Were there any external events or factors that could have influenced the results during this period? For instance, holidays, marketing campaigns, or other significant events could impact user behavior.
5. **Stability of Metrics**: If the metrics being observed stabilize quickly and remain consistent, a shorter experiment might be sufficient. Conversely, if they fluctuate significantly, a longer timeframe might be necessary.

Let's start by determining the exact duration of the experiment and analyzing the volume of data during this period. We'll then assess the variability over time by looking at the web data.




In [None]:
# Convert the date_time column to datetime format
df_web_data['date_time'] = pd.to_datetime(df_web_data['date_time'])

# Determine the exact duration of the experiment
start_date = df_web_data['date_time'].min()
end_date = df_web_data['date_time'].max()
duration = (end_date - start_date).days

# Analyze the volume of data (number of engagements) for each day of the experiment
daily_engagements = df_web_data.groupby(df_web_data['date_time'].dt.date).size()

duration, daily_engagements


Here's the assessment based on the data:

1. **Duration of Experiment**: 
   - The experiment ran for 97 days, from 2017-03-15 to 2017-06-20.

2. **Volume of Data**: 
   - The daily engagements (number of interactions) range from a few hundred to several thousand per day. This indicates a high volume of data, which is good for capturing meaningful insights.

3. **Variability Over Time**: 
   - The volume of data varies from day to day, and there's no obvious pattern at a glance. However, it's worth noting that the engagements increased significantly towards the end of the experiment period. This could be influenced by factors like marketing campaigns, seasonality, or external events.

To further assess the duration's adequacy, we can visualize the daily engagements to see if there are any clear patterns or anomalies. This will help determine if the experiment captured any weekly or monthly trends and if the timeframe was sufficient.


In [None]:
import matplotlib.pyplot as plt

# Plotting daily engagements over time
plt.figure(figsize=(15, 7))
daily_engagements.plot()
plt.title('Daily Engagements Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Engagements')
plt.grid(True)
plt.tight_layout()
plt.show()


The visualization presents the daily engagements over the course of the experiment. 

Here are some observations:

1. **Trends**: There's a noticeable increase in engagements, with a subsequent decline followed by another surge. This suggests some variability and potentially external influences affecting client engagement.
2. **Weekly Patterns**: There seems to be some periodicity in the data, possibly indicating weekly patterns. This periodicity is a good sign as it suggests that the duration of the experiment was long enough to capture these patterns multiple times.

In conclusion:

- The experiment's duration of 97 days seems adequate to capture a variety of patterns and trends.
- The experiment spanned multiple weeks, which is beneficial for capturing any weekly patterns.

It might be beneficial to investigate the reasons behind the significant spikes in engagements and ensure they aren't introducing any biases to the experimental results. Overall, the timeframe appears sufficient to gather meaningful insights, but understanding the context behind significant changes in engagement is crucial.

- **Additional Data Needs**:
    - What other data, if available, could enhance the analysis?

Having additional data can provide a more comprehensive view of the experiment and aid in drawing more accurate and insightful conclusions. Here are some types of data that could enhance the analysis:

1. **User Feedback**: Direct feedback from clients about their experience with the new vs. old design can provide qualitative insights into the quantitative findings.
  
2. **Engagement Duration**: The amount of time clients spend on the platform can provide insights into their level of engagement and satisfaction.
  
3. **Session Details**: Information about the specific actions users take during each session (e.g., pages visited, features used) can highlight which aspects of the design are most and least effective.
  
4. **Conversion Metrics**: If there are specific actions or outcomes the design aims to promote (e.g., product purchases, sign-ups), tracking these conversion metrics can be invaluable.
  
5. **Device and Browser Information**: Understanding which devices (mobile vs. desktop) or browsers clients are using can reveal if there are design inconsistencies or issues specific to certain technologies.
  
6. **External Factors**: Data on external marketing campaigns, promotions, or events that occurred during the experiment period can help explain spikes or drops in engagement.
  
7. **Demographic Segmentation**: More detailed demographic or behavioral data (e.g., occupation, education level, frequency of use) can help in segmenting the analysis and understanding how different client groups react to the designs.
  
8. **Error Logs**: If there were technical issues or bugs with either design, error logs can provide insights into how often these issues occurred and their impact.
  
9. **Exit Surveys**: Surveys given to clients who decide to stop using the service during the experiment period can provide insights into potential issues with either design.
  
10. **Historical Data**: Data from before the experiment started can set a baseline and help understand if observed changes are truly due to the experiment or are part of a larger trend.

While not all additional data types may be relevant or feasible for every experiment, considering these can lead to a more holistic understanding of the experiment's results and the factors driving those results.

# Bonus: More on Client Behavior Analysis


## Interaction Patterns
    - How do clients navigate through the old versus the new digital process? Do they follow similar steps or diverge at certain points?

To analyze how clients navigate the old vs. new digital process, we'll need to:

- Join the **df_web_data** with **df_experiment_clients** to determine which process each navigation step belongs to.
- Analyze the frequency and order of the process steps for both "Control" and "Test" groups.

Let's start by merging the datasets.

In [None]:
# Displaying the first few rows of the merged dataframe
df_web_experiment_merged.head()


The datasets have been successfully merged. The `Variation` column indicates whether a particular navigation step is from the old process ("Control") or the new one ("Test").

To understand how clients navigate through the old and new digital processes, we'll:

1. Group by the `Variation` and `process_step` columns to determine the frequency of each step for both the old and new processes.
2. Analyze the order in which the steps are taken for each process.

Let's start by analyzing the frequency of each step for both processes.

In [None]:
# Grouping by Variation and process_step to get the frequency of each step
step_frequency = df_web_experiment_merged.groupby(['Variation', 'process_step']).size().reset_index(name='frequency')

# Sorting the frequencies for better visualization
step_frequency = step_frequency.sort_values(by=['Variation', 'frequency'], ascending=[True, False])

step_frequency

In [None]:
# Another way of seeing this result, in a pivot table

# Group by Variation and process_step to get the frequency of each step for both processes
step_frequency = df_web_experiment_merged.groupby(['Variation', 'process_step']).size().reset_index(name='count')

step_frequency.pivot(index='process_step', columns='Variation', values='count').fillna(0)


From the frequency analysis, we observe the following:

- For both the old (Control) and new (Test) digital processes, the most common first step is "start", with more clients starting the process in the Test group compared to the Control group.
- The sequence generally seems to be "start" -> "step_1" -> "step_2" -> "step_3" -> "confirm", based on the decreasing counts.
- For every step in the process, there are more occurrences in the Test group than in the Control group, suggesting that the new digital process might be more engaging or user-friendly, leading to more clients reaching each subsequent step.

## Difference in the number of actions (steps) taken 

**Objective**:

We aim to investigate if there's a difference in the number of actions (steps) taken by users between the Test and Control groups.


**Hypothesis:**
1. **Null Hypothesis $(H_0$)**: The average number of actions taken by users for the Test group (new design) is equal to the average number of actions taken by users for the Control group (old design).
2. **Alternative Hypothesis $(H_a$)**: The average number of actions taken by users for the Test group (new design) is different from the average number of actions taken by users for the Control group (old design).


**Data Preparation**

To test this hypothesis, we'll need to:


1. Aggregate the number of actions (steps) taken by each user in both the Test and Control groups.
1. Compare the means to understand the data.
2. Use a two-sample t-test to compare the means of the two groups.

Let's begin by aggregating the number of actions taken by each user in the Test and Control groups.

In [None]:
# Filter the relevant columns: client_id, Variation, process_step
data = df_web_experiment_merged[['client_id', 'Variation', 'process_step']]

# Group by client_id and Variation and count the number of process steps (actions)
grouped_data = data.groupby(['client_id', 'Variation']).count().reset_index()

# Split the data into control and test groups
control_group = grouped_data[grouped_data['Variation'] == 'Control']
test_group = grouped_data[grouped_data['Variation'] == 'Test']

control_group.head(), test_group.head()

Before conducting the hypothesis test, let's get an initial sense of the data by comparing the means of the two groups. This will give us an idea about which group tends to take more actions on average.

To determine which group takes more actions on average, you can compare the mean number of actions (process steps) for the Control group with that of the Test group.

In [None]:
# Calculate the mean number of actions for both groups
mean_control = control_group['process_step'].mean()
mean_test = test_group['process_step'].mean()

mean_control, mean_test


From this initial comparison, we observe that users in the Test group (with the new design) tend to take more actions on average compared to users in the Control group (with the old design).

To determine if there's a significant difference in the average number of actions taken by users between the Test and Control groups, we'll perform an independent two-sample t-test. This test will compare the means of the two groups to check if they are statistically different from each other.

Let's conduct the t-test:

In [None]:
from scipy.stats import ttest_ind

# Conduct an independent t-test
t_stat, p_value = ttest_ind(control_group['process_step'], test_group['process_step'])

t_stat, p_value


Given the very small p-value (much less than the typical significance level of 0.05), we can reject the null hypothesis. This suggests that there is a statistically significant difference in the average number of actions taken by users between the Test and Control groups.

**Conclusion**:

This suggests that there is a statistically significant difference in the average number of actions taken by users between the Test and Control groups. The data provides strong evidence to support the alternative hypothesis that the average number of actions taken by users for the Test group (new design) is different from the average number of actions taken by users for the Control group (old design). Furthermore, as observed from the means, users in the Test group tend to take more actions on average.

# Bonus - power and effect size

## Power


**Definition**: Power is the probability that a statistical test correctly rejects the null hypothesis when it is indeed false. In simpler terms, it's the ability of a test to detect an effect if there truly is one.
- It's the complement of the Type II error rate ($beta$): ($1 - \beta$). A common value for power is 1–0.2 = 0.8. 

**Importance**: 
- Imagine you're trying to find out if a new medicine works better than an old one. If the test has low power, you might conclude that the new medicine doesn't make a difference, when in reality, it does.
- High power reduces the risk of Type II errors (failing to detect an effect when one exists).

**Designing an Experiment vs. Interpreting Results**:
- **Designing**: Before running an experiment, researchers often conduct a "power analysis" to determine the necessary sample size to detect an effect of a certain size with a certain degree of confidence. If power is too low, the researcher may increase the sample size or adjust the design to improve the power.
    -    - In this context, you usually know or have an estimate of the effect size you care about (based on prior studies, expert judgment, or practical significance), and you want to ensure your study is sufficiently powered to detect this effect.
- **Interpreting**: After the results are in, power (also Post-hoc Power) can provide context. Its used especially when an expected effect was not found (i.e., a non-significant result).
   - The primary purpose is to determine if the study was underpowered (i.e., the sample size was too small) to detect an effect of the observed size.
   - This can help differentiate between two interpretations of a non-significant result: 
     1. The true effect is close to zero.
     2. The study was underpowered to detect the true effect.
   - However, post-hoc power analysis has been criticized in the statistical community because it can be redundant. For example, if you have a non-significant result, post-hoc power will inevitably be low. If you have a significant result, post-hoc power will be high.


## **Effect Size**

**Definition**: Effect size quantifies the size of the difference between groups or the strength of a relationship between variables. It provides a measure that is free from sample size, allowing a comparison of results across different studies or experiments.
- Common measurement is Cohen’s h or d. Cohen’s d is used for comparison between 2 means and Cohen’s h is used for comparison between 2 proportions. 
- In cases where effect size is unknown, an accepted benchmark set by Cohen as a rule of thumb for effect size is as follows: Small = 0.2, Medium = 0.5, Large = 0.8. 
- This is set based on the experiment’s unique context on how much of an effect from the treatment will be considered as great/significant for the company.

**Importance**:
- Small sample studies can produce statistically significant results even for trivial findings, while large sample studies might find statistically insignificant results that are still of practical significance. Effect size helps in distinguishing between statistical significance and "real-world" or practical significance.

**Designing an Experiment vs. Interpreting Results**:
- **Designing**: Researchers make an educated guess about the expected effect size based on previous research or pilot studies. This anticipated effect size is then used in power analysis to determine the required sample size.
- **Interpreting**: Once the study is done, the calculated effect size tells us how large the observed effect is. Coupled with statistical significance, it provides a fuller picture of the results. For instance, you might have a significant result, but if the effect size is tiny, it might not be of practical importance.

**In Summary**

While p-values tell us if an effect exists, the power and effect size tell us how confident we should be in that result and how big that effect is, respectively. Power and effect size are complementary tools to p-values and are crucial for both designing robust experiments and interpreting their results in a meaningful context.

## **Interpreting results**

Lets calculate the effect size and post-hoc power for the first test: since the new design (Test group) had a higher completion rate compared to the old design (Control group), we might be interested in confirming if this difference is statistically significant.

1. **Effect Size Calculation**:
   We'll use Cohen's $h$ for the effect size in a two-proportions scenario. The formula for Cohen's $h$ is:
   
   $$[ h = 2 \times (\arcsin(\sqrt{p_1}) - \arcsin(\sqrt{p_2})) ]$$

   where $p_1$ and $p_2$ are the proportions from the two groups.

2. **Power Calculation** (post-hoc power):
   We'll use the `statsmodels` library, which provides a function to calculate the power of a two-proportions z-test.



With these, we can determine the power of the test, which tells us the probability that we would detect a difference in completion rates given our sample sizes and the observed effect size.


In [None]:
import numpy as np
import statsmodels.stats.api as sms

# grouped_data calculated above has total_users which is n (control and test), and p (control and test)
n_control = grouped_data.total_users[0] # n control
n_test = grouped_data.total_users[1] # n test
p_control = grouped_data.p[0] # p control
p_test = grouped_data.p[1] # p test

# Calculate Cohen's h as effect size
h_effect_size = sms.proportion_effectsize(p_test, p_control)


# Define the alpha level (significance level)
alpha = 0.05

# Calculate the power of the test 
power_calculated = sms.NormalIndPower().solve_power(effect_size=h_effect_size, nobs1=n_test, alpha=alpha, ratio=n_control/n_test, alternative='two-sided')

h_effect_size, power_calculated


In [None]:
grouped_data


1. **Effect Size (Cohen's ($h$) )**: Approximately 0.628
   - This is a measure of the size of the observed effect in terms of standard deviation units. A value of 0.628 is generally considered a medium to large effect size, indicating a substantial difference between the two groups. This suggests that the difference is not only statistically significant (as indicated by our hypothesis test) but also practically meaningful.
   
2. **Post-hoc Power**: 1.0 (or 100%)
   - This suggests that, with the observed effect size and sample sizes, the test has a 100% chance of correctly rejecting the null hypothesis if it is false, i.e. detecting the observed difference in completion rates between the Test and Control groups.
    - However, it's worth noting that a post-hoc power of 1.0 typically indicates a very strong effect, which aligns with the results of our hypothesis test and the calculated effect size. It's also a reminder that post-hoc power analyses mirror the results of the hypothesis test: when you have a significant result, post-hoc power will be high.
    - In practice, power analyses are most useful when planning experiments to ensure you have a sufficient sample size to detect an effect of interest. In this case, it confirms that our experiment was well-powered to detect the observed difference. These values give you confidence in the results of the experiment. The sample size is more than sufficient to detect the observed difference between the two groups, and the effect size provides context about how large this difference is in practical terms.

## Designing an experiment

Let's assume we are in the phase of designing the experiment for the same hypothesis test.

### Steps for Power Analysis

To calculate the minimum sample size for an experiment, especially for a two-proportion z-test, you'd typically need:


1. **Define the Baseline Proportion**:
   - This is the expected proportion in the control group, often based on previous data or historical benchmarks. For instance, if historically 74% of users complete the process on your website, that's your baseline proportion.

2. **Determine the Anticipated Effect Size**:
   - Decide on the smallest difference between the control and test groups that you want to be able to detect. This is the difference from the baseline proportion. If you're testing a new website process design and anticipate a 5% improvement in conversion rate from a baseline of 74%, then the effect size is 5% or 0.05.
   - Often referred to as the **Minimum Detectable Effect (MDE)**.

3. **Specify Desired Power and Significance Level**:
   - **Power**: The probability of correctly rejecting the null hypothesis when the alternative hypothesis is true. Commonly set at 0.80, which means there's an 80% chance of detecting the anticipated effect if it truly exists.
   - **Significance Level ($alpha$)**: The risk of a Type I error, which is rejecting the null hypothesis when it's actually true. It's often set at 0.05, denoting a 5% chance of finding an effect that doesn't truly exist.

4. **Account for the Ratio of Participants**:
   - Specify the ratio of participants between the test and control groups. It's common to have an equal number of participants in both groups (ratio = 1), but sometimes experiments might have imbalanced groups.

5. **Determine the Sample Size**:
   - Using the above parameters, you can calculate the required sample size for each group using statistical methods or software tools.

By following these steps and incorporating the necessary parameters, you can ensure that your experiment is adequately powered to detect the effect size of interest.

### Python Code for Power Analysis

For this example, let's assume:
- You anticipate a 1% difference in conversion rates between the Test and Control groups.
    p_control_anticipated = 0.10  # 10%
    p_test_anticipated = 0.11  # 11%
- You want to have a power of 0.80.
- You're using a significance level ($alpha$) of 0.05.
- Ratio of Treatment vs. Control: 
   - This refers to the relative allocation of participants between the two groups. 
   - A ratio of 50/50 (or 1:1) means the Test group and Control group have an equal number of participants.
   - A ratio of 25/75 (or 1:3) means that for every participant in the Test group, there are three in the Control group. The value \( k \) often represents this ratio, where $( k = \frac{\text{size of Test group}}{\text{size of Control group}} $).
   - This ratio is about how you allocate participants in your experiment.

Let's determine the required sample size for each group using these parameters.


In [None]:
# Define the anticipated completion rates for Control and Test groups
p_control_anticipated = 0.74  # 74%
p_test_anticipated = 0.79  # 79%

# Calculate the anticipated effect size
anticipated_effect_size = sms.proportion_effectsize(p_control_anticipated, p_test_anticipated)

# Specify parameters for power analysis 
alpha = 0.05 
power = 0.8 
ratio = 1.0 # 50/50 treatment vs control. if 25-75 treatment vs control then k=0.5 

# Using statsmodels to determine the required sample size for a two-proportion z-test
required_sample_size = sms.NormalIndPower().solve_power(effect_size=anticipated_effect_size, power=power, alpha=alpha, ratio=ratio, alternative='two-sided')

required_sample_size

The minimum sample size required per group to detect a 5% increase in completion rate (from a baseline of 74% to 79%) with a significance level of ( $alpha$ = 0.05) and a power of 0.80 is approximately 1125 users.

This means that to have an 80% chance of detecting a 5% increase in the completion rate, you would need at least 1125 users in both the Test and Control groups.

Remember, this calculation assumes that the true proportions in the Test and Control groups will be 79% and 74%, respectively. If the true proportions are different, the actual power of the test will differ from the desired power. Adjusting other parameters, like the desired power or the significance level, will also change the required sample size.

### Formula

The formula for the minimum sample size ( $n $) per group for a two-proportion z-test is:

$$[
n = \left( \frac{{z_{\alpha/2} + z_{\beta}}}{{p_{\text{MDE}}}} \right)^2 \times \left( p_{\text{baseline}}(1-p_{\text{baseline}}) + p_{\text{alt}}(1-p_{\text{alt}}) \right)
] $$ 

Where:
- $ p_{\text{MDE}} $ is the Minimum Detectable Effect.
- $ p_{\text{alt}} $ is the proportion in the test group, which is $( p_{\text{baseline}} + p_{\text{MDE}} $).
- $ z_{\alpha/2} $ is the z-value associated with a two-tailed test of significance level $( \alpha $). For $\alpha = 0.05 $, $ z_{\alpha/2} $ is approximately 1.96.
- $z_{\beta} $ is the z-value associated with the desired power.

Let's calculate the minimum sample size required given a hypothetical baseline proportion and MDE. For demonstration purposes, let's assume:
- $ p_{\text{baseline}} = 0.74 $ (the observed completion rate for the Control group).
- We want to detect a 5% increase in completion rate (so $( p_{\text{MDE}} = 0.05 $)).
-  $alpha = 0.05 $ and desired power is 0.80.

If instead of doing the above with python functions, you do the raw calculations using the formula, you'll get the same result.