# Case Study: Navigating Vanguard's Digital Redesign

Note: This is one of many possible solutions. This should serve as a guide.

Here's a brief overview of each dataset:

1. **df_demo**: Contains demographic details for each client, such as tenure, age, gender, number of accounts, balance, etc.
2. **df_experiment_clients**: Provides information on whether a client was exposed to the old ("Control") or new ("Test") digital process.
3. **df_web_data**: Captures the client's navigation through the digital process, detailing the steps they took.

Let's load the datasets and combine the split data files (df_final_web_data_pt_1 and df_final_web_data_pt_2).

In [None]:
import pandas as pd

path = 'https://github.com/data-bootcamp-v4/lessons/raw/main/6_inf_stats/project/files_for_project/'

# Correctly parsing the dataframes
df_demo = pd.read_csv(path+'df_final_demo.txt', sep=",")
df_experiment_clients = pd.read_csv(path+'df_final_experiment_clients.txt', sep=",")
df_web_data_pt_1 = pd.read_csv(path+'df_final_web_data_pt_1.txt', sep=",")
df_web_data_pt_2 = pd.read_csv(path+'df_final_web_data_pt_2.txt', sep=",")

# Combining web data
df_web_data = pd.concat([df_web_data_pt_1, df_web_data_pt_2], ignore_index=True)

## EDA and Data Cleaning

Let's start with the Exploratory Data Analysis (EDA). Whenever necessary, we will also do data cleaning.

### Duplicates, missing values and outliers

1. **Handling Duplicates**:
    - Check if there are any duplicate rows in each dataframe and remove them if necessary.

In [None]:
# 1. Handling Duplicates

# Check for duplicate rows in each dataframe
duplicates = {
    "df_demo": df_demo.duplicated().sum(),
    "df_experiment_clients": df_experiment_clients.duplicated().sum(),
    "df_web_data": df_web_data.duplicated().sum()
}

# If there are duplicates, we will drop them
if duplicates["df_demo"] > 0:
    df_demo.drop_duplicates(inplace=True)
if duplicates["df_experiment_clients"] > 0:
    df_experiment_clients.drop_duplicates(inplace=True)
if duplicates["df_web_data"] > 0:
    df_web_data.drop_duplicates(inplace=True)

duplicates # Before


In [None]:
{
    "df_demo": df_demo.duplicated().sum(),
    "df_experiment_clients": df_experiment_clients.duplicated().sum(),
    "df_web_data": df_web_data.duplicated().sum()
} # after

We observed that there were 10,764 duplicate rows in the `df_web_data` dataframe. These duplicates have been removed.


2. **Handling Missing Values**:
    - Address the missing values. Depending on the nature of the missing data, we can either:
        - Fill with mean, median, or mode (for numerical data).
        - Fill with a specific value or category.
        - Remove the rows with missing data (if the proportion of missing data is small).

Let's start by checking for missing values in each dataframe.

In [None]:
# Checking for missing values in each dataframe
missing_values = {
    "df_demo": df_demo.isnull().sum(),
    "df_experiment_clients": df_experiment_clients.isnull().sum(),
    "df_web_data": df_web_data.isnull().sum()
}

missing_values

Here are the findings regarding missing values:

1. **df_demo**:
   - `clnt_tenure_yr`, `clnt_tenure_mnth`, `gendr`, `num_accts`, `bal`, `calls_6_mnth`, and `logons_6_mnth` each have 14 missing values.
   - `clnt_age` has 15 missing values.

2. **df_experiment_clients**:
   - `Variation` has 20,109 missing values.

3. **df_web_data**:
   - There are no missing values in this dataframe.


Let's address missing values the following way:

1. For the `df_demo` dataframe, we'll fill missing values for numeric columns with their median and for the categorical column `gendr` with the mode.
2. For the `df_experiment_clients` dataframe, the `Variation` column has missing values indicating clients who weren't part of the experiment, we'll ignore them for now.

In [None]:
# 2. Handling Missing Values

# For df_demo dataframe
numeric_cols_demo = ['clnt_tenure_yr', 'clnt_tenure_mnth', 'clnt_age', 'num_accts', 'bal', 'calls_6_mnth', 'logons_6_mnth']
for col in numeric_cols_demo:
    df_demo[col].fillna(df_demo[col].median(), inplace=True)
    
df_demo['gendr'].fillna(df_demo['gendr'].mode()[0], inplace=True)

# Check if missing values are addressed
missing_values_updated = {
    "df_demo": df_demo.isnull().sum().max(),
    "df_experiment_clients": df_experiment_clients.isnull().sum().max(),
    "df_web_data": df_web_data.isnull().sum().max()
}

missing_values_updated


All missing values have been successfully addressed:

- For the `df_demo` dataframe, numerical columns were filled with their respective medians, and the `gendr` column was filled with its mode.
- For the `df_experiment_clients` dataframe, the `Variation` column's missing values were filled with the label "Not_Participated".


    
3. **Addressing Outliers**:
    - Identify potential outliers in the key numerical columns and decide on an approach to handle them.


Let's identify and handle potential outliers in the key numerical columns of the `df_demo` dataframe. We'll use the IQR (Interquartile Range) method to detect outliers. The columns of primary interest for outlier detection include `clnt_age`, `bal`, `calls_6_mnth`, and `logons_6_mnth`.




In [None]:
# 3. Addressing Outliers

# Columns to check for outliers
cols_to_check = ['clnt_age', 'bal', 'calls_6_mnth', 'logons_6_mnth']

outliers = {}

# Detect outliers using IQR method
for col in cols_to_check:
    Q1 = df_demo[col].quantile(0.25)
    Q3 = df_demo[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers[col] = df_demo[(df_demo[col] < lower_bound) | (df_demo[col] > upper_bound)].shape[0]

outliers



The IQR method reveals potential outliers in the `bal` (balance) column, with 8,021 data points identified as outliers. The other columns, namely `clnt_age`, `calls_6_mnth`, and `logons_6_mnth`, do not have any outliers based on this method.

For the `bal` column:
- Given the nature of financial data, it's common to have a wide range of balances, with some clients having exceptionally high balances. 
- Therefore, instead of removing these outliers, it might be more appropriate to keep them, as they represent genuine data points.

However, it's essential to make a note of these outliers, as they can impact certain analyses, especially if the goal is to make generalizations or assumptions about the broader client base.


### Summary Statistics

Next, let's generate summary statistics for the `df_demo` dataframe to understand the distribution of key variables. This will provide insights into measures like mean, median, standard deviation, and range for numerical variables.

In [None]:
# Descriptive statistics for numerical columns in df_demo
df_demo_descriptive = df_demo.describe()

df_demo_descriptive

The descriptive statistics for the numerical columns in the `df_demo` dataset reveal the following insights:

1. **client_id**: There are 70,609 unique client IDs.
2. **clnt_tenure_yr** & **clnt_tenure_mnth**: The average tenure of clients is about 12 years (or approximately 151 months). The minimum tenure is 2 years (or 33 months) and the maximum is 62 years (or 749 months).
3. **clnt_age**: The average age of clients is approximately 46 years, with a minimum of 13.5 and a maximum of 96 years.
4. **num_accts**: Most clients have around 2 accounts, with a minimum of 1 and a maximum of 8 accounts.
5. **bal**: The average balance is about USD 147,445. However, the large standard deviation indicates a wide range of balances, from a minimum of USD 13,789 to a maximum of USD 16,320,040.
6. **calls_6_mnth**: On average, clients made around 3 calls in the last 6 months, with a maximum of 7 calls.
7. **logons_6_mnth**: Clients logged on to the platform an average of 5.6 times in the last 6 months, with a maximum of 9 logons.

### Distribution of clnt_age, bal, calls_6_mnth, and logons_6_mnth


Next, we can visualize the distribution of some of the key variables, such as `clnt_age`, `bal`, `calls_6_mnth`, and `logons_6_mnth`, to gain a better understanding of their spread and identify any outliers. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Setting up the figure and axes
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))
fig.suptitle('Distribution of Key Variables', fontsize=16)

# Plotting the distributions
sns.histplot(df_demo['clnt_age'], kde=True, ax=axs[0, 0]).set_title('Distribution of Client Age')
sns.histplot(df_demo['bal'], kde=True, ax=axs[0, 1]).set_title('Distribution of Balance')
sns.histplot(df_demo['calls_6_mnth'], kde=True, ax=axs[1, 0]).set_title('Distribution of Calls in 6 Months')
sns.histplot(df_demo['logons_6_mnth'], kde=True, ax=axs[1, 1]).set_title('Distribution of Logons in 6 Months')

# Adjusting layout
plt.tight_layout()
plt.subplots_adjust(top=0.90)
plt.show()


Given the data available, we can delve deeper into the following aspects:

1. **Gender Distribution**: Understand the distribution of genders among the clients.
2. **Tenure Distribution**: Investigate the distribution of client tenure in years and months.
3. **Relationship Analysis**: Examine the relationship between variables, such as the relationship between balance and client age, or the number of logons and calls in the last 6 months.
4. **Breakdown by Variation (Test vs. Control)**: Analyze the data distribution for clients in the Test group vs. the Control group from the `df_experiment_clients` dataframe.
5. **Web Data Analysis**: Explore the most common process steps clients go through and their frequency, based on the `df_web_data` dataframe.



### Gender Distribution

We'll start by visualizing the distribution of genders among the clients in the `df_demo` dataframe.


In [None]:
# Gender Distribution
gender_counts = df_demo['gendr'].value_counts()

# Plotting
plt.figure(figsize=(8, 6))
sns.barplot(x=gender_counts.index, y=gender_counts.values, palette="coolwarm")
plt.title('Gender Distribution', fontsize=15)
plt.xlabel('Gender')
plt.ylabel('Number of Clients')
plt.show()


The gender distribution visualization reveals the following:

- The majority of clients have an unspecified or unknown gender (denoted as "U").
- The number of male (M) and female (F) clients seems relatively balanced, with males being slightly more in number.

### Tenure Distribution

Next, we'll visualize the distribution of client tenure in years (`clnt_tenure_yr`).

In [None]:
# Tenure Distribution in Years
plt.figure(figsize=(10, 6))
sns.histplot(df_demo['clnt_tenure_yr'], kde=True, bins=30, color="skyblue")
plt.title('Distribution of Client Tenure in Years', fontsize=15)
plt.xlabel('Tenure (Years)')
plt.ylabel('Number of Clients')
plt.show()


The distribution of client tenure in years reveals:

- A significant number of clients have a tenure between 2 to 10 years.
- The number of clients decreases as tenure increases, with fewer clients having tenures above 30 years.

### Relationship Analysis

Let's examine the relationship between balance (`bal`) and client age (`clnt_age`) using a scatter plot. This will help us understand if there's any visible correlation between the age of a client and their account balance.

We could look for relationships between many more variables but we won't get into that now.

In [None]:
# Scatter plot between balance and client age
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df_demo['clnt_age'], y=df_demo['bal'], alpha=0.5, edgecolor=None)
plt.title('Relationship between Client Age and Balance', fontsize=15)
plt.xlabel('Client Age')
plt.ylabel('Balance')
plt.show()


From the scatter plot analyzing the relationship between client age and balance:

- There doesn't appear to be a strong linear correlation between client age and balance.
- Clients across all age groups have varying balances, with many having balances on the lower end.
- There are a few clients, regardless of age, with exceptionally high balances, visible as the scattered points on the upper end of the balance axis.

### Breakdown by Variation (Test vs. Control)

Let's analyze the data distribution for clients in the Test group versus the Control group from the `df_experiment_clients` dataframe. We'll start by visualizing the distribution of clients in each group.


In [None]:
# Merging the experiment dataframe with the demo dataframe to get complete data for each group
df_demo_experiment_merged = pd.merge(df_demo, df_experiment_clients, on="client_id", how="left")

# Distribution of Test vs Control
variation_counts = df_demo_experiment_merged['Variation'].value_counts()

# Plotting
plt.figure(figsize=(8, 6))
sns.barplot(x=variation_counts.index, y=variation_counts.values, palette="viridis")
plt.title('Distribution of Clients in Test vs Control Group', fontsize=15)
plt.xlabel('Group')
plt.ylabel('Number of Clients')
plt.show()


From the visualization of the distribution of clients in the Test vs. Control group:

- There's a relatively balanced distribution between clients in the Test group and those in the Control group.
- The Test group has slightly more clients than the Control group.

### Web Data Analysis

Let's explore the most common process steps clients go through based on the `df_web_data` dataframe. We'll visualize the frequency of each process step.



In [None]:
# Process step frequency
process_step_counts = df_web_data['process_step'].value_counts()

# Plotting
plt.figure(figsize=(10, 6))
sns.barplot(x=process_step_counts.index, y=process_step_counts.values, palette="magma")
plt.title('Frequency of Process Steps', fontsize=15)
plt.xlabel('Process Step')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()


From the visualization of the frequency of process steps:

- `start` appears to be the most frequent process step that clients go through.
- This is followed by `step_1` and `step_2`, respectively.
- The other steps (`step_3` and `confirm`) have lower frequencies.

### More EDA

#### Outlier Detection

For example, let's visualize potential outliers in the balance column using a box plot.

In [None]:
sns.boxplot(x=df_demo['bal'])
plt.title('Box Plot of Balance')
plt.show()


#### Categorical Variable Exploration

Investigate the average balance across different genders.

In [None]:
sns.barplot(x='gendr', y='bal', data=df_demo)
plt.title('Average Balance by Gender')
plt.show()


#### Time Series Analysis
We can analyze trends over time using the date_time column from the df_web_data dataframe. Here's a simple monthly trend of logins:

In [None]:
df_web_data['date_time'] = pd.to_datetime(df_web_data['date_time'])
df_web_data.groupby(df_web_data['date_time'].dt.to_period("M")).size().plot()
plt.title('Monthly Trend of Logins')
plt.show()


#### Correlation Analysis

In [None]:
correlation_matrix = df_demo.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


#### Segmentation Analysis

Segmenting clients based on age and analyzing their average balance

In [None]:
bins = [20, 30, 40, 50, 60, 70, 80, 90, 100]
labels = ['20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90+']
df_demo['age_group'] = pd.cut(df_demo['clnt_age'], bins=bins, labels=labels, right=False)
sns.barplot(x='age_group', y='bal', data=df_demo)
plt.title('Average Balance by Age Group')
plt.show()
