1. EDA and Data Cleaning
2. Performance Metrics
3. Hypothesis testing
4. Experiment Evaluation
5. Tableau

* Goal: to see if the new design leads to a better user experience and higher process completion rates.
* The critical question was: Would these changes encourage more clients to complete the process?
* A\B Testing: Control group and Test group.
* an identical process sequence: an initial page, three subsequent steps, and finally, a confirmation page signaling process completion.


1. Client Profiles (df_final_demo): Demographics like age, gender, and account details of our clients.
2. Digital Footprints (df_final_web_data): A detailed trace of client interactions online, divided into two parts: pt_1 and pt_2. It’s recommended to merge these two files prior to a comprehensive data analysis.
3. Experiment Roster (df_final_experiment_clients): A list revealing which clients were part of the grand experiment

* Gender :  'U' as undisclosed  AND  'X' as unspecified.

Primary objective is to decode the experiment’s performance. 

#### Import Neccessary libraries

In [17]:
import pandas as pd
import numpy as np
import scipy as sc

#### Load and clean the 'client profile' dataset

In [4]:
demog = pd.read_csv("df_final_demo.csv")
demog.head()

Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,gendr,num_accts,bal,calls_6_mnth,logons_6_mnth
0,836976,6.0,73.0,60.5,U,2.0,45105.3,6.0,9.0
1,2304905,7.0,94.0,58.0,U,2.0,110860.3,6.0,9.0
2,1439522,5.0,64.0,32.0,U,2.0,52467.79,6.0,9.0
3,1562045,16.0,198.0,49.0,M,2.0,67454.65,3.0,6.0
4,5126305,12.0,145.0,33.0,F,2.0,103671.75,0.0,3.0


In [5]:
demog.tail()

Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,gendr,num_accts,bal,calls_6_mnth,logons_6_mnth
70604,7993686,4.0,56.0,38.5,U,3.0,1411062.68,5.0,5.0
70605,8981690,12.0,148.0,31.0,M,2.0,101867.07,6.0,6.0
70606,333913,16.0,198.0,61.5,F,2.0,40745.0,3.0,3.0
70607,1573142,21.0,255.0,68.0,M,3.0,475114.69,4.0,4.0
70608,5602139,21.0,254.0,59.5,F,3.0,157498.73,7.0,7.0


In [6]:
demog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70609 entries, 0 to 70608
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   client_id         70609 non-null  int64  
 1   clnt_tenure_yr    70595 non-null  float64
 2   clnt_tenure_mnth  70595 non-null  float64
 3   clnt_age          70594 non-null  float64
 4   gendr             70595 non-null  object 
 5   num_accts         70595 non-null  float64
 6   bal               70595 non-null  float64
 7   calls_6_mnth      70595 non-null  float64
 8   logons_6_mnth     70595 non-null  float64
dtypes: float64(7), int64(1), object(1)
memory usage: 4.8+ MB


In [11]:
demog.duplicated().sum()

0

In [7]:
demog.isnull().sum()

client_id            0
clnt_tenure_yr      14
clnt_tenure_mnth    14
clnt_age            15
gendr               14
num_accts           14
bal                 14
calls_6_mnth        14
logons_6_mnth       14
dtype: int64

First let's precise or approximate the age column and handle the null values in each column acccordingly then change the float datatype into integer. Since the null datapoints are very less comparatively to the whole dataset it is a good idea to drop them once.

In [18]:
demog['clnt_age'] = demog['clnt_age'].apply(lambda x: round(x,0))

In [22]:
demog.dropna(inplace = True)

In [23]:
# Let's check our dataset for null value
demog.isnull().sum()

client_id           0
clnt_tenure_yr      0
clnt_tenure_mnth    0
clnt_age            0
gendr               0
num_accts           0
bal                 0
calls_6_mnth        0
logons_6_mnth       0
dtype: int64

In [24]:
demog[['clnt_tenure_yr', 'clnt_tenure_mnth','clnt_age','num_accts','bal','calls_6_mnth','logons_6_mnth']] = demog[['clnt_tenure_yr', 'clnt_tenure_mnth','clnt_age','num_accts','bal','calls_6_mnth','logons_6_mnth']].astype(int)

Now let us check our dataset entirely and save it as csv file for EDA.

In [25]:
demog.head()

Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,gendr,num_accts,bal,calls_6_mnth,logons_6_mnth
0,836976,6,73,60,U,2,45105,6,9
1,2304905,7,94,58,U,2,110860,6,9
2,1439522,5,64,32,U,2,52467,6,9
3,1562045,16,198,49,M,2,67454,3,6
4,5126305,12,145,33,F,2,103671,0,3


In [27]:
demog.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70594 entries, 0 to 70608
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   client_id         70594 non-null  int64 
 1   clnt_tenure_yr    70594 non-null  int32 
 2   clnt_tenure_mnth  70594 non-null  int32 
 3   clnt_age          70594 non-null  int32 
 4   gendr             70594 non-null  object
 5   num_accts         70594 non-null  int32 
 6   bal               70594 non-null  int32 
 7   calls_6_mnth      70594 non-null  int32 
 8   logons_6_mnth     70594 non-null  int32 
dtypes: int32(7), int64(1), object(1)
memory usage: 3.5+ MB


In [28]:
# Save the clean data to csv format
demog.to_csv('demog.csv')

#### Load and clean the 'df_final_web_data' dataset