# **Cleaning Data Conclusions**

## Day 2
- `merged_final_demo_final_experiment_clients_df`

# Final Demo + Final Experiment Clients (Merged DataFrame)

### Table Overview

**Rows:**

- `client_id (int)`: A unique identifier for each client, used to distinguish one client from another in the dataset.
- `client_tenure_years (int)`: The number of years a client has been associated with the company. For example, a client with client_tenure_years = 6 has been with the company for 6 years.
- `client_tenure_months (int)`: The number of months a client has been associated with the company. This value is often more granular than client_tenure_years and could be used for more detailed analysis. For instance, a tenure of 6 years and 1 month would be represented as 73 months.
- `client_age (int)`: The age of the client in years.
- `gender`: The gender of the client. The value can be "Male," "Female," or "Unspecified," meaning the gender data is either recorded or missing.
- `num_accounts (int)`: The number of accounts the client has with the company.
- `balance (float)`: The total balance of the client's accounts with the company. This is a monetary value, and the balance can indicate how much money the client holds across their accounts.
- `calls_last_6_months (int)`: The number of calls the client has made to the company in the past six months. This can give an idea of how actively the client has engaged with the company.
- `logons_last_6_months (int)`: The number of times the client has logged into their account or interacted with the company online in the past six months.
- `variation (object)`: This column likely indicates whether the client is part of a control group or a test group for an experiment. In this case, clients are either labeled as "Test", "Control", or "Unknown".

## Day 1 & 2 (Week 5)

### **Client behavior analysis**

Answer the following questions about demographics:

- Who are the primary clients using this online process?
- Are the primary clients younger or older, new or long-standing?
- Next, carry out a client behaviour analysis to answer any additional relevant questions you think are important.

### **Active clients (higher logons and accounts):**

- Average client age: 46.18 years
- Average client tenure: 12.05 years

### **Conclusion:**
- The primary clients are generally older (above 40 years old).
- The primary clients are generally long-standing (over 3 years).

![Image Description](../../visualizations/barplot_client_age_distribution.png)

![Image Description](../../visualizations/barplot_client_tenure_distribution.png)

## Day 3 (Week 5)

### **Performance Metrics**

**Success Indicators**

Discovering what key performance indicators (KPIs) will determine the success of the new design.
Use at least completion rate, time spent on each step and error rates. Add any KPIs you might find relevant.

## **KPIs - Completion Rate**
The proportion of users who reach the final `confirm` step.

**How**

- Analyzing the dataset specifically focusing on calculating the completion rate, based on the steps that clients take. We initially convert the `step` names to numerical values for easier calculation. Our code filters the rows where `step` is equal to 4 (the final step or completion of the process). `clients_finished` now contains all records of clients who completed the process.

- Using `.nunique()` we calculate the total numbers of distinct `client_id` values in the dataframe, getting the count of unique clients. With that, we calculate the number of unique clients who finished the process.

- Grouping `clients_finished` by `client_id` we count how many times each flient has completed the process. All results are stored in a new dataframe called `completion_count_df`, where `completion_count` represents the number of times each client reached step 4.

- Any `NaN` values in the `completion_count` column will be replaced by 0, indicating that these clients never completed the process.

- These new columns and data will be added to the dataframe.

### **Results**

The result *"Clients who finished the process: 47,787 out of 70,594"* provides a summary of how many clients in the dataset have successfully completed the process (reached the final step, which is step 4 in this case), compared to the total number of clients.

#### **Breaking Down the Information:**

- **Clients who finished the process (47,787):**
This number represents the clients who reached the final step (step 4) in the project or process. It's a count of how many clients successfully completed the entire journey or task.

- **Total unique clients (70,594):**
This is the total number of unique clients in the dataset. The client_id column likely represents individual clients, and nunique() is used to count how many distinct clients are in the dataset, regardless of how many steps they completed.

#### **What Can We Infer?**

The ratio of clients who finished the process to the total number of clients can be calculated as:

**Completion Rate = 47,78770,594 × 100 ≈ 67.7%**

**Completion Rate = 70,59447,787​ × 100 ≈ 67.7%**

So, about 67.7% of the clients who started the process completed it.

- **Client Drop-off:**
The remaining 32.3% of clients (70,594 - 47,787 = 22,807) did not reach the final step. This could indicate a drop-off or abandonment rate, where clients started but didn’t finish.
Understanding the reasons for this drop-off (e.g., user experience issues, complexity, lack of incentives) could help improve the process or identify areas for intervention to increase completion rates.

- **Implications for Project Metrics:**
A 67.7% completion rate can be seen as relatively good in many contexts, especially if the process is long or involves several steps. However, in some industries or projects, you may want to aim for a higher completion rate.
If this is a metric for performance or KPIs (Key Performance Indicators), you might want to set a target (e.g., 80% completion rate) and use this data to track progress toward that goal.

#### **Potential Follow-up Analysis:**

- **Examine the Drop-off Points:**
You might want to analyze where the drop-offs occur in the steps leading up to step 4. For instance, do most clients drop off at a specific step (e.g., step 3), or is the drop-off more evenly distributed?
Investigating this can give insight into potential bottlenecks or friction points in the process.

- **Segment Clients:**
Segmenting clients by other characteristics (e.g., demographics, usage patterns, source of acquisition) could reveal if some groups are more likely to complete the process than others. This could help tailor interventions for high-value or at-risk clients.

- **Multiple Completions:**
If some clients complete the process multiple times (as might be possible in certain scenarios), this could skew the results. You could consider focusing on unique clients who completed the process at least once versus total completions.

- **Client Retention:**
If the process completion rate is tied to retention or customer loyalty (e.g., users who finish the process are more likely to stay), then this result could be a strong indicator of overall client engagement.

#### **Conclusion:**

The result shows that **approximately 67.7% of clients successfully completed the process, while about 32.3% did not**. This completion rate could be a useful KPI for understanding how effectively the project is engaging and retaining clients, as well as identifying areas for improvement in the process.


![Image Description](../../visualizations/kpi_plots/barchart_total_vs_completed_clients.png)


![Image Description](../../visualizations/kpi_plots/piechart_completion_rate_clients.png)


## **KPIs - Time Spent on Each Step**
The average duration users spend on each step.

**How:**

Our code calculates the average time spent by clients on each step of a process during each visit, using the timestamp data for each step. The key idea is to calculate the time difference between consecutive steps for each client and then group this data by `client_id` and `step` to compute average times. The final results are presented in both seconds and minutes.

- We ensured that the `date_time` column is in a proper datetime format so it can be computed.
- The dataframe is sorted by `client_id`, `visit_id`, and `step` to ensure chronological order. This is mandatory for calculating the time differences correctly.
- We created a `time_spent` column to substract the current step's `date_time` from the next `step_time`, this gives the duration spent on the current step.
- `shift(-1) is used to shift the `date_time` values for each group, effectively getting the timestamp of the next step in sequence.
- For step 4, there is no "next step" because it's the final step in the process. The code calculates the `next_step_time` as the `maximum date_time` for each visit (`visit_id`), assuming this represents the end of the session.
- After updating the `next_step_time` for step 4, the `time_spent` for step 4 is recalculated.
- The data is grouped by `client_id` and `step`, and the mean time spent on each step is calculated for each client. This gives the average time spent per step for each client. The average time spent will also be calculated in minutes and seconds for further reading.


The final dataframe, `avg_time_per_step`, contains the following columns:
- `client_id`: The unique client identifier.
- `step`: The step number (1, 2, 3, or 4).
- `avg_time_spent`: The average time spent on each step (as a timedelta object).
- `avg_time_seconds`: The average time spent in seconds.
- `avg_time_minutes`: The average time spent in minutes.


The function returns a dataframe `avg_time_per_step`, which contains the average time spent by each client on each step of the process. This is useful for understanding how much time clients typically spend at each stage of the process, which can help identify bottlenecks or stages that need improvement.

### **Results**

| Step      | Average time (minutes) |
|-----------|--------------------------|
| Start     | 0.673287                 |
| Step 1    | 0.813043                 |
| Step 2    | 1.563080                 |
| Step 3    | 2.144761                 |
| Finish    | 0.320263                 |

#### **Conclusion:**

The average time analysis indicates that while Steps 2 and 3 may require additional client attention or effort, Step 4 shows a decrease in time spent. Understanding why clients spend less time on Step 4 could reveal potential areas for improvement in user experience or process completion rates, being the final step has an impact, but we don't know what exactly happens in that step. Optimizing Step 2 and Step 3, which are the longest steps, might also enhance client engagement and process efficiency.


Notice that what is considered a step to calculate is a very subjective decision.

![Image Description](../../visualizations/kpi_plots/barplot_avg_time_per_step.png)

![Image Description](../../visualizations/kpi_plots/lineplot_avg_time_per_step.png)


## **KPIs - Error Rates**
If there’s a step where users go back to a previous step, it may indicate confusion or an error. You should consider moving from a later step to an earlier one as an error.

**How**

calculate_error_count(df) that processes a dataframe (df) containing client visit data and calculates various error-related metrics. Specifically, it tracks instances where a client "steps back" in the process (e.g., when a client moves backward from a higher step number to a lower one within the same visit). The function calculates the total number of errors, error rates, and provides useful insights into how often clients make these backward steps during the process.

### **Results**

**Error rate (clients with errors): 29.95%**

29.95% of clients have at least one backward step during their visit. This means that nearly 30% of the clients in our dataset encountered some form of issue, where they moved backward in the process.
This is a relatively high error rate, suggesting that a significant portion of clients are experiencing difficulties during their journey, which may require further investigation to improve the process.

**Error rate (steps with errors): 8.13%**

8.13% of all steps across all visits are backward steps (errors). This shows that a small but noticeable percentage of the steps in the process are errors, meaning that clients are not always following the expected forward progression.
While this percentage is not overwhelmingly high, it still represents a non-negligible proportion of the entire process, and further investigation could help optimize the flow.

**Error rate per step: 8.13%**

This is identical to the "steps with errors" rate because each step is either an error (backward movement) or not. Essentially, this shows that 8.13% of all individual steps across all visits are backward steps.
This rate gives an overall picture of how often clients make errors during the entire process, not just at specific stages.

![Image Description](../../visualizations/kpi_plots/piechart_error_clients.png)


![Image Description](../../visualizations/kpi_plots/lineplot_error_rate_per_step.png)


## Day 4 (Week 5)

Confirm if the difference in completion rate of the the new design and the old design is statistically significant.
Given the data and KPIs you have explored discussed, one interesting hypothesis to test is related to the completion rate between the Test and Control groups. Since the new design (Test group) had a higher completion rate compared to the old design (Control group), you are required to confirm if this difference is statistically significant.



### **Results**

#### **Control Variation**

Completion Rate: Clients who finished the process: 65.6% (15428 out ouf 23526).

Total number of errors across all clients: 9576

Median time spent across all steps: 0.88 minutes

| Step | Average Time (Minutes) |
|------|------------------------|
| 0    | 0.00                   |
| 1    | 0.40                   |
| 2    | 1.15                   |
| 3    | 1.70                   |
| 4    | 0.92                   |


#### **Test Variation**

Completion Rate: Clients who finished the process: 69.3% (18682 out ouf 26961).

Total number of errors across all clients: 16229

Median time spent across all steps: 1.18 minutes

| Step | Average Time (Minutes) |
|------|------------------------|
| 0    | 0.00                   |
| 1    | 0.40                   |
| 2    | 1.15                   |
| 3    | 1.70                   |
| 4    | 0.92                   |

![Image Description](../../visualizations/kpi_plots/barplot_completion_rate_variations.png)


![Image Description](../../visualizations/kpi_plots/piechart_completion_rate_variations.png)

![Image Description](../../visualizations/kpi_plots/barplot_error_count_per_variation.png)


## Day 5 (Week 5)

## **A/B Test**

 When analyzing the results of an A/B test (where you compare two versions of a product or design), you're required to ensure that the new design (Test ) improves the completion rate by at least 5% compared to the current version (Control).
 
 We performed an analysis to compare the completion rates between a Test Group and a Control Group using a Z-test for proportions.

### 1. Data Cleaning:

The code first removes duplicate rows from both test_df and control_df dataframes based on the combination of client_id and visitor_id. This ensures that each client_id and visitor_id pair appears only once in each dataframe. After removing duplicates, the code ensures that the completion_count variable is binary (i.e., either 0 or 1). This is done using the .apply() function with a lambda expression. The lambda function assigns a 1 if completion_count is greater than 0 (indicating a completion) and assigns a 0 if it is 0 or less (indicating no completion). This step ensures that the variable is clean and standardized for the analysis.

### 2. Calculate Completions and Sample Size

The code calculates the total number of completions for each group:
- test_completions is the sum of the completion_count column in the test_df_cleaned dataframe (i.e., the total number of completions in the Test Group).
- control_completions is the sum of the completion_count column in the control_df_cleaned dataframe (i.e., the total number of completions in the Control Group).

The sample size for each group is also calculated:
- test_size is the number of rows (data points) in the cleaned test dataframe (test_df_cleaned).
- control_size is the number of rows in the cleaned control dataframe (control_df_cleaned).

### 3. Calculate Completion Rates

The completion rate for each group is calculated by dividing the number of completions by the total sample size for that group:
- test_completion_rate is calculated by dividing test_completions by test_size.
- control_completion_rate is calculated by dividing control_completions by control_size.

The calculated completion rates are printed as percentages (by multiplying by 100 and rounding to two decimal places).

### 4. Check for Valid Completion Rates

The code checks if either of the completion rates is either 0 or 1 (which would indicate that every participant in the group either did or did not complete the action). This is important because a completion rate of 0 or 1 would result in a standard error of 0, which makes the Z-test invalid.
If either group's completion rate is 0 or 1, the code prints a message indicating that the Z-test is skipped due to invalid completion rates.

### 5. Perform Z-Test for Proportions

Once we know the completion rates for both the Test Group and the Control Group, we check if the difference between them is big enough to be **statistically significant** (i.e., it’s not just due to random chance).

Here’s how it works:

1. **Combine the two groups**: First, we calculate the overall completion rate for both groups combined (this is called the "pooled proportion").

2. **Calculate the "Standard Error" (SE)**: This tells us how much the completion rates can vary. It’s like measuring the "spread" or "noise" in the data. If the groups are very similar, the SE will be small. If they’re very different, the SE will be bigger.

3. **Check if the SE is valid**: If the SE can’t be calculated (which might happen with extreme data), we skip the test and show an error.

4. **Calculate the Z-score**: This is the key number we use to see if the difference between the two groups is big enough to be important. The Z-score is how many times bigger the difference is compared to the SE.

5. **Find the p-value**: This tells us the likelihood that the difference we see is just by chance. A smaller p-value means the difference is more likely to be real and not due to randomness. If the p-value is less than 0.05 (5%), we say the difference is **statistically significant**.


In short, the Z-test checks if the difference between the groups is big enough to be confident that it’s not just random. If it is, we can say the difference is "real".

### 6. Check Statistical Significance

The code checks if the p-value is smaller than the chosen significance level (alpha = 0.05, which corresponds to a 95% confidence level). If the p-value is less than 0.05, the difference between the completion rates is considered statistically significant, and the code prints a message indicating that. Otherwise, it prints that the difference is not statistically significant.



![alt text](../../visualizations/abtest_score_zvalue_pvalue.png)

![alt text](../../visualizations/abtest_completion_rates.png)

### **Conclusions**

#### Completion Rate Comparison:
- **Test Group**: 69.77% completion rate
- **Control Group**: 65.90% completion rate

The **test group** (with the new design) has a higher completion rate than the **control group** (with the old design). This suggests the new design works better at getting people to finish.

#### Statistical Significance:
- **Z-score**: 9.82 (a really big number!)
- **P-value**: 0.0000 (way smaller than 0.05)

The **Z-score** is really high, showing a big difference between the two groups. The **p-value** is super small, which means the difference is **not by chance** – it’s real.

#### What We Believe:
- The idea that **there’s no difference** (null hypothesis) is **rejected**.
- We accept that the **test group** does better (alternative hypothesis).

#### What This Means in Practice:
- Since the test group performed better, the new design seems like a **better option** than the old one.
- The difference is so big that it’s **probably not random**. The result is solid.

#### Confidence Interval:
- The 95% confidence interval shows that the true difference isn’t zero, which means the test result is reliable.

#### What to Do Next:
1. **Use the new design**: The new design works better, so consider using it.
2. **Check if it’s worth the cost**: See if the better completion rate is worth any extra costs.
3. **Test more**: Before rolling it out to everyone, keep testing to make sure it works well everywhere.

#### Final Takeaway:
- The test shows that the new design helps people complete more actions. It’s a **real difference**, so go ahead with it, as long as the benefits outweigh any extra costs.

