Having conducted an initial analysis in the previous notebook, I have gained a comprehensive understanding of the dataset, including insights into the client characteristics.
In this notebook, I will focus on analyzing the results of the A/B test, including data preparation, designing metrics and statistical tests (hypothesis testing) to verify to results.

# Importing libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import numpy as np
import datetime

import scipy.stats as st
from scipy.stats import chi2_contingency
from scipy.stats.contingency import association
from scipy.stats import norm
import scipy.stats as stats

from functools import reduce

# Loading datasets

Importing the cleaned datasets. In the trace notebook, I make sure that the datetime is imported as datetime.

In [2]:
df_clients = pd.read_csv('../data/df_clients_CLEANED.csv')
df_trace = pd.read_csv('../data/df_trace_CLEANED.csv', parse_dates=['date_time'])
df_roster = pd.read_csv('../data/df_roster_CLEANED.csv')

# Data preparation

To begin with, I will add a new column at the df_trace, to identify to which group (Control or Test) the client belongs.

In [3]:
# first, making a dictionary to use for mapping for the Test and Control group
dict_client_group = df_roster[df_roster.variation.isin(['Test','Control'])].set_index('client_id')['variation'].to_dict()

df_trace['group'] = df_trace['client_id'].map(dict_client_group)

In [4]:
# I will get drop the traces of the clients that belonged to none of the two groups
df_trace.dropna(subset='group', inplace=True)

For the A/B testing, I will be using the visit_id to identify the individual sessions, and analyse each visit.

In [5]:
df_trace.head()

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time,group
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07,Test
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51,Test
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22,Test
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13,Test
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04,Test


In [6]:
# I am dropping the visitor_id column, cause it will not be used in this analysis
df_trace.drop(columns='visitor_id', inplace=True)

# renaming the process_step column for brevity
df_trace.rename(columns={'process_step': 'step'}, inplace=True)

# renaming the step names with numbers for analysis purposes (will do subtraction later)
df_trace['step']  = df_trace['step'].map({'start':1,'step_1':2,'step_2':3,'step_3':4,'confirm':5})

# Sorting the dataframe
df_trace.sort_values(by=['visit_id','date_time'], ascending=True, inplace=True)

# Resetting index
df_trace = df_trace.reset_index(drop=True)

In [7]:
df_trace.head()

Unnamed: 0,client_id,visit_id,step,date_time,group
0,3561384,100012776_37918976071_457913,5,2017-04-26 13:22:17,Test
1,3561384,100012776_37918976071_457913,5,2017-04-26 13:23:09,Test
2,7338123,100019538_17884295066_43909,1,2017-04-09 16:20:56,Test
3,7338123,100019538_17884295066_43909,2,2017-04-09 16:21:12,Test
4,7338123,100019538_17884295066_43909,3,2017-04-09 16:21:21,Test


# Metrics for evaluating the A/B test

Now I am going to start looking into the seperate visits. I need to understand the following:
- is the process completed? (meaning, has it reached the confirmed step?)
- in how much time is the process completed? (taking into account the total duration of the visit)
- how much time is it spent on each step?
- how many errors there are, in the form of steps back? (for each session)
- how many errors there are, in the form of consecutive records of the same step?

With these metrics I can identify and compare the completion rates between the Control and the Test group. Also with the metrics of time and errors I can assess potential user confusion or system issues in the new UI.

First, I will create columns that will help me with the comparisons I need to make. Here I will assess errors related to the steps.

In [8]:
# columns to track the previous and the next steps
df_trace['prev_step'] = df_trace.groupby('visit_id')['step'].shift(1)

# here I will add a column where I subtract the previous step, to be able to count the steps back
df_trace['step_back'] = (df_trace['step'] - df_trace['prev_step'] < 0).astype(int)

# here I am checking if there are multiple records for the same step
df_trace['step_repeat'] = (df_trace['step'] - df_trace['prev_step'] == 0).astype(int)

# Dropping the guide columns
df_trace.drop(columns='prev_step', inplace=True)

Here I am assessing time metrics.

In [9]:
# columns to tracκ the timestamp of these steps
df_trace['next_time'] = df_trace.groupby('visit_id')['date_time'].shift(-1)

# creating also columns with the time difference
df_trace['time_on_step'] = df_trace['next_time'] - df_trace['date_time']

# the above created timedelta values, here I am transforming them to seconds, to make comparison easier
df_trace['time_on_step']  = df_trace['time_on_step'].dt.total_seconds()

# Dropping the guide columns
df_trace.drop(columns='next_time', inplace=True)

In [10]:
df_trace.head()

Unnamed: 0,client_id,visit_id,step,date_time,group,step_back,step_repeat,time_on_step
0,3561384,100012776_37918976071_457913,5,2017-04-26 13:22:17,Test,0,0,52.0
1,3561384,100012776_37918976071_457913,5,2017-04-26 13:23:09,Test,0,1,
2,7338123,100019538_17884295066_43909,1,2017-04-09 16:20:56,Test,0,0,16.0
3,7338123,100019538_17884295066_43909,2,2017-04-09 16:21:12,Test,0,0,9.0
4,7338123,100019538_17884295066_43909,3,2017-04-09 16:21:21,Test,0,0,14.0


**Metric completed session**

Here I will find which visit_ids have reached the 'confirm' step (or step 5). Adding a note, in the next sections of the notebook I will check the validity of these completed sessions.

In [11]:
print(f"Out of the total {df_trace.visit_id.nunique()} sessions, {sum(df_trace.groupby('visit_id')['step'].max() == 5)} have reached the 'confirm' step.")

Out of the total 69205 sessions, 37680 have reached the 'confirm' step.


In [12]:
# Calculating the maximum step that a visit reached
max_step = df_trace.groupby('visit_id')['step'].max()

# Adding a column to identify the visits that reached confirm
df_trace['reached_step_5'] = df_trace['visit_id'].map(max_step == 5).astype(int)

**Invalid completed sessions**

One important part to check for the completed sessions, is to make sure they are valid. For a completed session to be valid, it needs to have gone through all the steps 1,2,3,4.

In [13]:
# Checking per visit, if the visit contains 
contains_steps_1234 = df_trace.groupby('visit_id')['step'].apply(lambda x: set([1,2, 3, 4]).issubset(x))

df_trace['reached_steps_1234'] = df_trace['visit_id'].map(contains_steps_1234).fillna(False).astype(int)

In [14]:
df_trace.head()

Unnamed: 0,client_id,visit_id,step,date_time,group,step_back,step_repeat,time_on_step,reached_step_5,reached_steps_1234
0,3561384,100012776_37918976071_457913,5,2017-04-26 13:22:17,Test,0,0,52.0,1,0
1,3561384,100012776_37918976071_457913,5,2017-04-26 13:23:09,Test,0,1,,1,0
2,7338123,100019538_17884295066_43909,1,2017-04-09 16:20:56,Test,0,0,16.0,1,1
3,7338123,100019538_17884295066_43909,2,2017-04-09 16:21:12,Test,0,0,9.0,1,1
4,7338123,100019538_17884295066_43909,3,2017-04-09 16:21:21,Test,0,0,14.0,1,1


# Final dataframes setup: visits info, steps duration, repeated steps, steps back

First, I am making a dataframe that summarizes all the metric I want to check per session (except for the steps error info).

In [15]:
df_visits_results = df_trace.groupby('visit_id').agg({
    'client_id':'first', 'date_time':'first', 'group':'first', 'time_on_step':'sum', 
    'step_back':'sum', 'step_repeat':'sum',
    'reached_step_5':'max', 'reached_steps_1234':'max'
    }).reset_index().rename(columns={
    'time_on_step':'total_visit_duration','step_back':'total_steps_back', 'step_repeat':'total_steps_repeat'
})

In [16]:
df_visits_results['completed'] = (
    (df_visits_results['reached_step_5'] == 1) & (df_visits_results['reached_steps_1234'] == 1)
).astype(int)


df_visits_results['completed_but_invalid'] = (
    (df_visits_results['reached_step_5'] == 1) & (df_visits_results['reached_steps_1234'] == 0)
).astype(int)

# dropping the columns for the specific steps, since now I created the results of completed/ valid visits
df_visits_results.drop(columns=['reached_step_5','reached_steps_1234'], inplace=True)


# taking out the time from the date
df_visits_results['date_time'] = df_visits_results['date_time'].dt.normalize()
df_visits_results.rename(columns={'date_time':'date'}, inplace=True)

df_visits_results.head()

Unnamed: 0,visit_id,client_id,date,group,total_visit_duration,total_steps_back,total_steps_repeat,completed,completed_but_invalid
0,100012776_37918976071_457913,3561384,2017-04-26,Test,52.0,0,1,0,1
1,100019538_17884295066_43909,7338123,2017-04-09,Test,242.0,2,2,1,0
2,100022086_87870757897_149620,2478628,2017-05-23,Test,180.0,0,0,1,0
3,100030127_47967100085_936361,105007,2017-03-22,Control,0.0,0,0,0,0
4,100037962_47432393712_705583,5623007,2017-04-14,Control,132.0,1,1,0,0


Now I will make another dataframe, to have the main error info for the steps per group. I am starting with the mean step duration.

In [17]:
df_steps_results = df_trace.groupby(['group', 'step']).agg({
    'time_on_step': 'mean',  # mean time spent on each step
    'date_time': 'size'     # occurrences of each step
}).reset_index()

# renaming columns for clarity
df_steps_results.rename(columns={
    'time_on_step': 'mean_time_on_step',
    'date_time': 'mean_time_step_count'
}, inplace=True)

df_steps_results

Unnamed: 0,group,step,mean_time_on_step,mean_time_step_count
0,Control,1,66.804757,45380
1,Control,2,50.535583,29544
2,Control,3,92.043223,25773
3,Control,4,137.224496,22503
4,Control,5,180.146051,17336
5,Test,1,61.454595,55773
6,Test,2,60.756987,38666
7,Test,3,88.876116,30899
8,Test,4,129.607488,25761
9,Test,5,250.566239,25600


Now, I will add to the above dataframe the information regarding consecutive step repetitions.

In [18]:
df_step_repeats = df_trace.groupby(['group','step'])['step_repeat'].sum().reset_index()

df_steps_results = pd.merge(df_steps_results, df_step_repeats, 
                            left_on=['group','step'], right_on=['group','step'])

df_steps_results

Unnamed: 0,group,step,mean_time_on_step,mean_time_step_count,step_repeat
0,Control,1,66.804757,45380,9664
1,Control,2,50.535583,29544,1186
2,Control,3,92.043223,25773,839
3,Control,4,137.224496,22503,750
4,Control,5,180.146051,17336,1239
5,Test,1,61.454595,55773,12206
6,Test,2,60.756987,38666,1573
7,Test,3,88.876116,30899,581
8,Test,4,129.607488,25761,876
9,Test,5,250.566239,25600,3790


Lastly, I am adding the info regarding the steps back.

In [19]:
df_step_back = df_trace.groupby(['group','step'])['step_back'].sum().reset_index()

df_steps_results = pd.merge(df_steps_results, df_step_back, 
                            left_on=['group','step'], right_on=['group','step'])

df_steps_results

Unnamed: 0,group,step,mean_time_on_step,mean_time_step_count,step_repeat,step_back
0,Control,1,66.804757,45380,9664,4932
1,Control,2,50.535583,29544,1186,2303
2,Control,3,92.043223,25773,839,2364
3,Control,4,137.224496,22503,750,101
4,Control,5,180.146051,17336,1239,0
5,Test,1,61.454595,55773,12206,10621
6,Test,2,60.756987,38666,1573,3414
7,Test,3,88.876116,30899,581,2283
8,Test,4,129.607488,25761,876,22
9,Test,5,250.566239,25600,3790,0


# Saving files

In [20]:
#df_visits_results.to_csv('../Data/df_visits_results.csv')

#df_steps_results.to_csv('../Data/df_steps_results.csv')